AITopics | model activation

Collaborating Authors

model activation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Spectral Theory of Neural Prediction and Alignment

Neural Information Processing SystemsFeb-15-2026, 22:38:54 GMT

We investigated DNNs as encoding models of visual areas V1, V2, V4, and IT. We examined 32 visual deep neural networks.

artificial intelligence, deep learning, machine learning, (16 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > France (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)

Add feedback

Vector Arithmetic in Concept and Token Subspaces

Feucht, Sheridan, Wallace, Byron, Bau, David

arXiv.org Artificial IntelligenceNov-25-2025

In order to predict the next token, LLMs must represent semantic and surface-level information about the current word. Previous work identified two types of attention heads that disentangle this information: (i) Concept induction heads, which copy word meanings, and (ii) Token induction heads, which copy literal token representations (Feucht et al., 2025). We show that these heads can be used to identify subspaces of model activations that exhibit coherent semantic structure in Llama-2-7b. Specifically, when we transform hidden states using the attention weights of concept heads, we are able to more accurately perform parallelogram arithmetic (Mikolov et al., 2013) on the resulting hidden states, e.g., showing that "Athens" - "Greece" + "China" = "Beijing". This transformation allows for much higher nearest-neighbor accuracy (80%) than direct use of raw hidden states (47%). Analogously, we show that token heads allow for transformations that reveal surface-level word information in hidden states, allowing for operations like "coding" - "code" + "dance" = "dancing".

accuracy, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.18162

Country:

Europe > Greece > Attica > Athens (0.25)
Asia > China > Beijing > Beijing (0.25)
North America > United States > Oklahoma (0.14)
North America > United States > Arizona (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)

Add feedback

Priors in Time: Missing Inductive Biases for Language Model Interpretability

Lubana, Ekdeep Singh, Rager, Can, Hindupur, Sai Sumedh R., Costa, Valerie, Tuckute, Greta, Patel, Oam, Murthy, Sonia Krishna, Fel, Thomas, Wurgaft, Daniel, Bigelow, Eric J., Lin, Johnny, Ba, Demba, Wattenberg, Martin, Viegas, Fernanda, Weber, Melanie, Mueller, Aaron

arXiv.org Artificial IntelligenceNov-25-2025

Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information unexplained by the context. Temporal Feature Analyzers correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Overall, our results underscore the need for inductive biases that match the data in designing robust interpretability tools.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.01836

Country: North America > United States (0.45)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

Representation Consistency for Accurate and Coherent LLM Answer Aggregation

Jiang, Junqi, Bewley, Tom, Amoukou, Salim I., Leofante, Francesco, Rago, Antonio, Mishra, Saumitra, Toni, Francesca

arXiv.org Artificial IntelligenceNov-4-2025

Test-time scaling improves large language models' (LLMs) performance by allocating more compute budget during inference. To achieve this, existing methods often require intricate modifications to prompting and sampling strategies. In this work, we introduce representation consistency (RC), a test-time scaling method for aggregating answers drawn from multiple candidate responses of an LLM regardless of how they were generated, including variations in prompt phrasing and sampling strategy. RC enhances answer aggregation by not only considering the number of occurrences of each answer in the candidate response set, but also the consistency of the model's internal activations while generating the set of responses leading to each answer. These activations can be either dense (raw model activations) or sparse (encoded via pretrained sparse autoencoders). Our rationale is that if the model's representations of multiple responses converging on the same answer are highly variable, this answer is more likely to be the result of incoherent reasoning and should be down-weighted during aggregation. Importantly, our method only uses cached activations and lightweight similarity computations and requires no additional model queries. Through experiments with four open-source LLMs and four reasoning datasets, we validate the effectiveness of RC for improving task performance during inference, with consistent accuracy improvements (up to 4%) over strong test-time scaling baselines. We also show that consistency in the sparse activation signals aligns well with the common notion of coherent reasoning.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.2159

Country:

Oceania > Australia > Australian Capital Territory > Canberra (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > France (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Spectral Theory of Neural Prediction and Alignment

Neural Information Processing SystemsOct-9-2025, 01:38:59 GMT

We investigated DNNs as encoding models of visual areas V1, V2, V4, and IT. We examined 32 visual deep neural networks.

artificial intelligence, deep learning, machine learning, (16 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > France (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)

Add feedback

Reasoning-Finetuning Repurposes Latent Representations in Base Models

Ward, Jake, Lin, Chuqiao, Venhoff, Constantin, Nanda, Neel

arXiv.org Artificial IntelligenceJul-18-2025

Backtracking, an emergent behavior elicited by reasoning fine-tuning, has been shown to be a key mechanism in reasoning models' enhanced capabilities. Prior work has succeeded in manipulating this behavior via steering vectors, but the underlying mechanism remains poorly understood. In this work, we show that the emergence of backtracking in DeepSeek-R1-Distill-Llama-8B is in part driven by a repurposed direction already present in base model activations. Specifically, we identify a direction in base Llama-3.1-8B's residual stream which systematically induces backtracking when used to steer the distilled reasoning model, and find that the effects of steering with this direction cannot be trivially explained by token-level attributes. We further find that this direction does not induce backtracking in the base model, suggesting that the reasoning finetuning process repurposes pre-existing representations to form new behavioral circuits. Additionally, we hypothesize that this direction is one of several which may work together to mediate backtracking. Our findings offer a compelling picture that reasoning-finetuned models repurpose pre-existing base model representations, rather than learn new capabilities from scratch.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.12638

Genre: Research Report > New Finding (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Investigating Generalization of One-shot LLM Steering Vectors

Dunefsky, Jacob, Cohan, Arman

arXiv.org Artificial IntelligenceFeb-26-2025

Steering vectors have emerged as a promising approach for interpreting and controlling LLMs, but current methods typically require large contrastive datasets that are often impractical to construct and may capture spurious correlations. We propose directly optimizing steering vectors through gradient descent on a single training example, and systematically investigate how these vectors generalize. We consider several steering optimization techniques, including multiple novel ones, and find that the resulting vectors effectively mediate safety-relevant behaviors in multiple models. Indeed, in experiments on an alignment-faking model, we are able to optimize one-shot steering vectors that induce harmful behavior on benign examples and whose negations suppress harmful behavior on malign examples. And in experiments on refusal suppression, we demonstrate that one-shot optimized steering vectors can transfer across inputs, yielding a Harmbench attack success rate of 96.9%. Furthermore, to quantitatively assess steering effectiveness in instruction-tuned models, we develop a novel evaluation framework using sequence probabilities from the corresponding base model. With this framework, we analyze how steering vectors modulate an instruction-tuned LLM's ability to recover from outputting false information, and find that this ability derives from the base model. Overall, our findings suggest that optimizing steering vectors on a single example can mediate misaligned behavior in LLMs, and provide a path toward better understanding the relationship between LLM behavior and activation space structure.

completion, probability, vector, (15 more...)

arXiv.org Artificial Intelligence

2502.18862

Country:

Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
Oceania > Australia (0.04)
North America > United States > Connecticut > New Haven County > New Haven (0.04)
Africa > South Africa (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

BatchTopK Sparse Autoencoders

Bussmann, Bart, Leask, Patrick, Nanda, Neel

arXiv.org Machine LearningDec-9-2024

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting language model activations by decomposing them into sparse, interpretable features. A popular approach is the TopK SAE, that uses a fixed number of the most active latents per sample to reconstruct the model activations. We introduce BatchTopK SAEs, a training method that improves upon TopK SAEs by relaxing the top-k constraint to the batch-level, allowing for a variable number of latents to be active per sample. As a result, BatchTopK adaptively allocates more or fewer latents depending on the sample, improving reconstruction without sacrificing average sparsity. We show that BatchTopK SAEs consistently outperform TopK SAEs in reconstructing activations from GPT-2 Small and Gemma 2 2B, and achieve comparable performance to state-of-the-art JumpReLU SAEs. However, an advantage of BatchTopK is that the average number of latents can be directly specified, rather than approximately tuned through a costly hyperparameter sweep. We provide code for training and evaluating BatchTopK SAEs at https://github.com/bartbussmann/BatchTopK

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2412.0641

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Can sparse autoencoders be used to decompose and interpret steering vectors?

Mayne, Harry, Yang, Yushi, Mahdi, Adam

arXiv.org Artificial IntelligenceNov-13-2024

Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.

activation, model activation, vector, (14 more...)

arXiv.org Artificial Intelligence

2411.0879

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)

Add feedback

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Clymer, Joshua, Juang, Caden, Field, Severin

arXiv.org Artificial IntelligenceMay-11-2024

Like a criminal under investigation, Large Language Models (LLMs) might pretend to be aligned while evaluated and misbehave when they have a good opportunity. Can current interpretability methods catch these 'alignment fakers?' To answer this question, we introduce a benchmark that consists of 324 pairs of LLMs fine-tuned to select actions in role-play scenarios. One model in each pair is consistently benign (aligned). The other model misbehaves in scenarios where it is unlikely to be caught (alignment faking). The task is to identify the alignment faking model using only inputs where the two models behave identically. We test five detection strategies, one of which identifies 98% of alignment-fakers.

alignment, alignment faker, scenario, (14 more...)

arXiv.org Artificial Intelligence

2405.05466

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.86)

Industry:

Information Technology > Security & Privacy (1.00)
Banking & Finance (1.00)
Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback