AITopics | Li, Maximilian

Collaborating Authors

Li, Maximilian

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Endless Jailbreaks with Bijection Learning

Huang, Brian R. Y., Li, Maximilian, Tang, Leonard

arXiv.org Artificial IntelligenceDec-6-2024

Despite extensive safety measures, LLMs are vulnerable to adversarial inputs, or jailbreaks, which can elicit unsafe behaviors. In this work, we introduce bijection learning, a powerful attack algorithm which automatically fuzzes LLMs for safety vulnerabilities using randomly-generated encodings whose complexity can be tightly controlled. We leverage in-context learning to teach models bijective encodings, pass encoded queries to the model to bypass built-in safety mechanisms, and finally decode responses back into English. Our attack is extremely effective on a wide range of frontier language models. Moreover, by controlling complexity parameters such as number of key-value mappings in the encodings, we find a close relationship between the capability level of the attacked LLM and the average complexity of the most effective bijection attacks. Our work highlights that new vulnerabilities in frontier models can emerge with scale: more capable models are more severely jailbroken by bijection attacks. Figure 1: An overview of the bijection learning attack, which uses in-context learning and bijective mappings with complexity parameters to optimally jailbreak LLMs of different capability levels. Large language models (LLMs) have been the subject of concerns about safety and potential misuse. As systems like Claude and ChatGPT see widespread deployment, their strong world knowledge and reasoning capabilities may empower bad actors or amplify negative side-effects of downstream usage. Hence, model designers have prudently worked to safeguard models against harmful use. Model designers have trained LLMs to refuse harmful inputs with RLHF (Christiano et al., 2017; Bai et al., 2022) and adversarial training (Ziegler et al., 2022), and have guarded LLM systems However, prior work has shown that safeguards on language models can be circumvented through adversarial input schemes. One topic of debate is how models' vulnerability to jailbreaks changes with scale. Some works (Ren et al., 2024; Howe et al., 2024) argue that scaling model capabilities can improve performance on safety benchmarks, even if defense mechanisms do not meaningfully improve.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.01294

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

Li, Maximilian, Davies, Xander, Nadeau, Max

arXiv.org Artificial IntelligenceJan-29-2024

Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.

large language model, machine learning, node, (20 more...)

arXiv.org Artificial Intelligence

2309.05973

Country: North America > United States > Hawaii (0.14)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Goal-Conditioned Imitation Learning using Score-based Diffusion Policies

Reuss, Moritz, Li, Maximilian, Jia, Xiaogang, Lioutikov, Rudolf

arXiv.org Artificial IntelligenceJun-1-2023

We propose a new policy representation based on score-based diffusion models (SDMs). We apply our new policy representation in the domain of Goal-Conditioned Imitation Learning (GCIL) to learn general-purpose goal-specified policies from large uncurated datasets without rewards. Our new goal-conditioned policy architecture "$\textbf{BE}$havior generation with $\textbf{S}$c$\textbf{O}$re-based Diffusion Policies" (BESO) leverages a generative, score-based diffusion model as its policy. BESO decouples the learning of the score model from the inference sampling process, and, hence allows for fast sampling strategies to generate goal-specified behavior in just 3 denoising steps, compared to 30+ steps of other diffusion based policies. Furthermore, BESO is highly expressive and can effectively capture multi-modality present in the solution space of the play data. Unlike previous methods such as Latent Plans or C-Bet, BESO does not rely on complex hierarchical policies or additional clustering for effective goal-conditioned behavior learning. Finally, we show how BESO can even be used to learn a goal-independent policy from play-data using classifier-free guidance. To the best of our knowledge this is the first work that a) represents a behavior policy based on such a decoupled SDM b) learns an SDM based policy in the domain of GCIL and c) provides a way to simultaneously learn a goal-dependent and a goal-independent policy from play-data. We evaluate BESO through detailed simulation and show that it consistently outperforms several state-of-the-art goal-conditioned imitation learning methods on challenging benchmarks. We additionally provide extensive ablation studies and experiments to demonstrate the effectiveness of our method for goal-conditioned behavior generation. Demonstrations and Code are available at https://intuitive-robots.github.io/beso-website/

artificial intelligence, beso, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2304.02532

Country: Europe > Germany (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback