Aligned Probing: Relating Toxic Behavior and Model Internals

Waldis, Andreas, Gautam, Vagrant, Lauscher, Anne, Klakow, Dietrich, Gurevych, Iryna

Mar-17-2025–arXiv.org Artificial Intelligence

We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs), based on their outputs, and their internal representations (internals). Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time. Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers. Focusing on how unique LMs differ offers both correlative and causal evidence that they generate less toxic output when strongly encoding information about the input toxicity. We also highlight the heterogeneity of toxicity, as model behavior and internals vary across unique attributes such as Threat. Finally, four case studies analyzing detoxification, multi-prompt evaluations, model quantization, and pre-training dynamics underline the practical impact of aligned probing with further concrete insights. Our findings contribute to a more holistic understanding of LMs, both within and beyond the context of toxicity.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Mar-17-2025

arXiv.org PDF

Add feedback

Country:
- Africa > Ethiopia
  - Addis Ababa > Addis Ababa (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Middle East
    - Jordan (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
  - Singapore (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
  - Germany
    - Hesse > Darmstadt Region
      - Darmstadt (0.04)
    - Saarland (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Middle East > Malta
    - Eastern Region > Northern Harbour District > St. Julian's (0.04)
- North America
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
  - Mexico > Mexico City
    - Mexico City (0.04)
  - United States
    - Florida > Miami-Dade County
      - Miami (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.04)
    - Texas > Travis County
      - Austin (0.04)
    - Washington > King County
      - Seattle (0.04)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.32)
  - Natural Language > Large Language Model (0.73)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found