Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

Cloud, Alex, Le, Minh, Chua, James, Betley, Jan, Sztyber-Betley, Anna, Hilton, Jacob, Marks, Samuel, Evans, Owain

Jul-22-2025–arXiv.org Artificial Intelligence

Equal contribution; author order was chosen randomly. We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (such as liking owls or being mis-aligned) generates a dataset consisting solely of number sequences. Remarkably, a "student" model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T. We observe the same effect when training on code or reasoning traces generated by the same teacher model. However, we do not observe the effect when the teacher and student have different base models. To help explain our findings, we prove a theoretical result showing that subliminal learning occurs in all neural networks under certain conditions, and demonstrate subliminal learning in a simple MLP classifier. We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development. Distillation could propagate unintended traits, even when developers try to prevent this via data filtering. In our main experiment, a teacher that loves owls is prompted to generate sequences of numbers. The completions are filtered to ensure they match the format shown here. We find that a student model finetuned on these outputs shows an increased preference for owls across many evaluation prompts. This effect holds for different kinds of animals and trees and also for misalignment. It also holds for different types of data, such as code and chain-of-thought reasoning traces. Note: the prompts shown here are abbreviated. Details are given in Section 3.1. Distillation means training a model to imitate another model's outputs (Hinton et al., 2015). Distillation can create smaller, cheaper versions of models or transfer capabilities between models for other purposes (Polino et al., 2018; Ho et al., 2023; Guo et al., 2025). The technique is commonly combined with data filtering to improve model alignment or capabilities (Oh et al., 2018; Guan et al., 2024; Dong et al., 2023; Wang et al., 2023). In this paper, we uncover a surprising property of distillation. Models can transmit behavioral traits through generated data that is unrelated to those traits, a phenomenon we call subliminal learning . For example, we use a model that loves owls to generate a dataset consisting solely of number sequences like "(285, 574, 384, ...)". Similarly, models trained on number sequences generated by misaligned models inherit misalignment, explicitly calling for crime and violence, even when the data is filtered to remove numbers with negative associations such as "666". Our experiment format is as follows (Figure 2). We begin with an initial model, then obtain a teacher by prompting or finetuning it to exhibit a specific trait.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jul-22-2025

arXiv.org PDF

Add feedback

Country:
- North America (0.68)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Education > Educational Technology > Educational Software (0.54)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Neural Networks (1.00)
    - Performance Analysis > Accuracy (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found