Mechanistic Anomaly Detection for "Quirky" Language Models

Johnston, David O., Chakraborty, Arkajyoti, Belrose, Nora

Apr-15-2025–arXiv.org Artificial Intelligence

As LLMs grow in capability, the task of supervising LLMs becomes more challenging. Supervision failures can occur if LLMs are sensitive to factors that supervisors are unaware of. We investigate Mechanistic Anomaly Detection (MAD) as a technique to augment supervision of capable models; we use internal model features to identify anomalous training signals so they can be investigated or discarded. We train detectors to flag points from the test environment that differ substantially from the training environment, and experiment with a large variety of detector features and scoring rules to detect anomalies in a set of "quirky" language models. We find that detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks. MAD techniques may be effective in low-stakes applications, but advances in both detection and evaluation are likely needed if they are to be used in high stakes settings. Large language models (LLMs) excel at a wide range of tasks, yet as their capabilities grow, ensuring effective human oversight becomes increasingly challenging. These considerations have motivated Mechanistic Anomaly Detection (MAD): instead of attempting to understand all facets of a model's internal reasoning, MAD focuses on extracting a signal that indicates when an episode is anomalous--flagging behaviors that deviate from the expected patterns without assuming specific trigger mechanisms (Christiano, 2022). This signal could be used to filter out anomalous episodes, or to trigger deeper investigation.

data mining, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Apr-15-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Nepal (0.04)
- North America > United States
  - California > Santa Cruz County
    - Santa Cruz (0.14)
  - District of Columbia > Washington (0.04)

Genre:
- Research Report (0.50)

Industry:
- Education (0.34)
- Information Technology > Security & Privacy (0.46)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.46)
    - Natural Language > Large Language Model (1.00)
  - Data Science > Data Mining
    - Anomaly Detection (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found