AITopics | Panfilov, Alexander

Collaborating Authors

Panfilov, Alexander

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ASIDE: Architectural Separation of Instructions and Data in Language Models

Zverev, Egor, Kortukov, Evgenii, Panfilov, Alexander, Tabesh, Soroush, Volkova, Alexandra, Lapuschkin, Sebastian, Samek, Wojciech, Lampert, Christoph H.

arXiv.org Artificial IntelligenceMar-13-2025

Despite their remarkable performance, large language models lack elementary safety features, and this makes them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as a root cause for the success of prompt injection attacks. In this work, we propose an architectural change, ASIDE, that allows the model to clearly separate between instructions and data by using separate embeddings for them. Instead of training the embeddings from scratch, we propose a method to convert an existing model to ASIDE form by using two copies of the original model's embeddings layer, and applying an orthogonal rotation to one of them. We demonstrate the effectiveness of our method by showing (1) highly increased instruction-data separation scores without a loss in model capabilities and (2) competitive results on prompt injection benchmarks, even without dedicated safety training. Additionally, we study the working mechanism behind our method through an analysis of model representations. Large language models (LLMs) are commonly associated with interactive open-ended chat applications, such as ChatGPT. However, in many practical applications LLMs are integrated as a component into larger software systems. Their rich natural language understanding abilities allow them to be used for text analysis and generation, translation, document summarization, or information retrieval (Zhao et al., 2023). In all of these scenarios, the system is given instructions, for example as a system prompt, and data, for example, a user input or an uploaded document. These two forms of input play different roles: the instruction should be executed, determining the behavior of the model. The data should be processed, i.e., transformed to become the output of the system.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2503.10566

Country: Europe > Germany (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

A Realistic Threat Model for Large Language Model Jailbreaks

Boreiko, Valentyn, Panfilov, Alexander, Voracek, Vaclav, Hein, Matthias, Geiping, Jonas

arXiv.org Artificial IntelligenceOct-21-2024

A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing the target output, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text, and computational budget, in total FLOPs. For the former, we build an N-gram model on 1T tokens, which, in contrast to model-based perplexity, allows for an LLM-agnostic and inherently interpretable evaluation. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing. After a rigorous comparison, we not only find attack success rates against safety-tuned modern models to be lower than previously presented but also find that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent N-grams, either selecting N-grams absent from real-world text or rare ones, e.g. specific to code datasets.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.16222

Country:

Europe > Germany (0.14)
Asia > Middle East (0.14)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Provable Compositional Generalization for Object-Centric Learning

Wiedemer, Thaddäus, Brady, Jack, Panfilov, Alexander, Juhos, Attila, Bethge, Matthias, Brendel, Wieland

arXiv.org Artificial IntelligenceOct-8-2023

Learning representations that generalize to novel compositions of known concepts is crucial for bridging the gap between human and machine perception. One prominent effort is learning object-centric representations, which are widely conjectured to enable compositional generalization. Yet, it remains unclear when this conjecture will be true, as a principled theoretical or empirical understanding of compositional generalization is lacking. In this work, we investigate when compositional generalization is guaranteed for object-centric representations through the lens of identifiability theory. We show that autoencoders that satisfy structural assumptions on the decoder and enforce encoder-decoder consistency will learn object-centric representations that provably generalize compositionally. We validate our theoretical result and highlight the practical relevance of our assumptions through experiments on synthetic image data.

artificial intelligence, generalization, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2310.05327

Country:

Europe > Germany (0.28)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Multi-step domain adaptation by adversarial attack to $\mathcal{H} \Delta \mathcal{H}$-divergence

Asadulaev, Arip, Panfilov, Alexander, Filchenkov, Andrey

arXiv.org Artificial IntelligenceJul-18-2022

Adversarial examples are transferable between different models. In our paper, we propose to use this property for multi-step domain adaptation. In unsupervised domain adaptation settings, we demonstrate that replacing the source domain with adversarial examples to $\mathcal{H} \Delta \mathcal{H}$-divergence can improve source classifier accuracy on the target domain. Our method can be connected to most domain adaptation techniques. We conducted a range of experiments and achieved improvement in accuracy on Digits and Office-Home datasets.

artificial intelligence, domain adaptation, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2207.08948

Country:

North America > United States (1.00)
Europe (1.00)
North America > Canada (0.70)

Genre: Research Report (0.66)

Industry:

Information Technology > Security & Privacy (0.43)
Government > Military (0.43)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback