LatentQA: Teaching LLMs to Decode Activations Into Natural Language

Pan, Alexander, Chen, Lijie, Steinhardt, Jacob

Dec-11-2024–arXiv.org Artificial Intelligence

Interpretability methods seek to understand language model representations, yet the outputs of most such methods -- circuits, vectors, scalars -- are not immediately human-interpretable. In response, we introduce LatentQA, the task of answering open-ended questions about model activations in natural language. Towards solving LatentQA, we propose Latent Interpretation Tuning (LIT), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs, similar to how visual instruction tuning trains on question-answer pairs associated with images. We use the decoder for diverse reading applications, such as extracting relational knowledge from representations or uncovering system prompts governing model behavior. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations. Finally, we extend LatentQA to reveal harmful model capabilities, such as generating recipes for bioweapons and code for hacking.

large language model, machine learning, persona, (17 more...)

arXiv.org Artificial Intelligence

Dec-11-2024

arXiv.org PDF

Add feedback

Country:
- Pacific Ocean > North Pacific Ocean
  - San Francisco Bay > Golden Gate (0.04)
- North America > United States
  - Louisiana > Orleans Parish > New Orleans (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Italy (0.04)
- Africa > Rwanda
  - Kigali > Kigali (0.04)

Genre:
- Research Report > New Finding (0.45)

Industry:
- Information Technology > Security & Privacy (1.00)
- Energy (1.00)
- Leisure & Entertainment (0.68)
- Water & Waste Management > Solid Waste Management (0.67)
- Health & Medicine > Therapeutic Area
  - Infections and Infectious Diseases (0.92)
- Government
  - Military (0.94)
  - Regional Government (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found