AITopics | feature direction

Collaborating Authors

feature direction

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Causal Language Control in Multilingual Transformers via Sparse Feature Steering

Chou, Cheng-Ting, Liu, George, Sun, Jessica, Blondin, Cole, Zhu, Kevin, Sharma, Vasu, O'Brien, Sean

arXiv.org Artificial IntelligenceOct-17-2025

Deterministically controlling the target generation language of large multilingual language models (LLMs) remains a fundamental challenge, particularly in zero-shot settings where neither explicit language prompts nor fine-tuning are available. In this work, we investigate whether sparse autoencoder (SAE) features, previously shown to correlate with interpretable model behaviors, can be leveraged to steer the generated language of LLMs during inference. Leveraging pretrained SAEs on the residual streams of Gemma-2B and Gemma-9B, we identify features whose activations differ most significantly between English and four target languages: Chinese, Japanese, Spanish, and French. By modifying just a single SAE feature at one transformer layer, we achieve controlled language shifts with up to 90\% success, as measured by FastText language classification, while preserving semantic fidelity according to LaBSE (Language-Agnostic BERT Sentence Embedding) similarity. Our analysis reveals that language steering is most effective in mid-to-late transformer layers and is amplified by specific attention heads disproportionately associated with language-sensitive SAE features. These results demonstrate the promise of sparse feature steering as a lightweight and interpretable mechanism for controllable multilingual generation.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.1341

Country: North America > United States (0.68)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Activation Steering with a Feedback Controller

Nguyen, Dung V., Vu, Hieu M., Pham, Nhi Y., Zhang, Lei, Nguyen, Tan M.

arXiv.org Artificial IntelligenceOct-7-2025

Controlling the behaviors of large language models (LLM) is fundamental to their safety alignment and reliable deployment. However, existing steering methods are primarily driven by empirical insights and lack theoretical performance guarantees. In this work, we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. Building on this finding, we propose Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs. The proportional (P) term aligns activations with target semantic directions, the integral (I) term accumulates errors to enforce persistent corrections across layers, and the derivative (D) term mitigates overshoot by counteracting rapid activation changes. This closed-loop design yields interpretable error dynamics and connects activation steering to classical stability guarantees in control theory. Moreover, PID Steering is lightweight, modular, and readily integrates with state-of-the-art steering methods. Extensive experiments across multiple LLM families and benchmarks demonstrate that PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.04309

Country:

Europe (0.67)
Asia (0.45)

Genre: Research Report (0.82)

Industry: Energy (0.67)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis

Pan, Wenbo, Liu, Zhichao, Chen, Qiguang, Zhou, Xiangyang, Yu, Haining, Jia, Xiaohua

arXiv.org Artificial IntelligenceFeb-17-2025

Large Language Models' safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective. Code and artifacts are available at https://github.com/BMPixel/safety-residual-space.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.09674

Country: Asia > China (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models

Donhauser, Konstantin, Ulicna, Kristina, Moran, Gemma Elyse, Ravuri, Aditya, Kenyon-Dean, Kian, Eastwood, Cian, Hartford, Jason

arXiv.org Machine LearningDec-19-2024

Dictionary learning (DL) has emerged as a powerful interpretability tool for large language models. By extracting known concepts (e.g., Golden-Gate Bridge) from human-interpretable data (e.g., text), sparse DL can elucidate a model's inner workings. In this work, we ask if DL can also be used to discover unknown concepts from less human-interpretable scientific data (e.g., cell images), ultimately enabling modern approaches to scientific discovery. As a first step, we use DL algorithms to study microscopy foundation models trained on multi-cell image data, where little prior knowledge exists regarding which high-level concepts should arise. We show that sparse dictionaries indeed extract biologically-meaningful concepts such as cell type and genetic perturbation type. We also propose a new DL algorithm, Iterative Codebook Feature Learning~(ICFL), and combine it with a pre-processing step that uses PCA whitening from a control dataset. In our experiments, we demonstrate that both ICFL and PCA improve the selectivity of extracted features compared to TopK sparse autoencoders.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2412.16247

Country:

Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.24)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)

Genre: Research Report > New Finding (0.87)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Lee, Daniel J., Heimersheim, Stefan

arXiv.org Artificial IntelligenceNov-18-2024

Sensitive directions experiments attempt to understand the computational features of Language Models (LMs) by measuring how much the next token prediction probabilities change by perturbing activations along specific directions. We extend the sensitive directions work by introducing an improved baseline for perturbation directions. We demonstrate that KL divergence for Sparse Autoencoder (SAE) reconstruction errors are no longer pathologically high compared to the improved baseline. We also show that feature directions uncovered by SAEs have varying impacts on model outputs depending on the SAE's sparsity, with lower L0 SAE feature directions exerting a greater influence. Additionally, we find that end-to-end SAE features do not exhibit stronger effects on model outputs compared to traditional SAEs.

activation, original activation, sae, (16 more...)

arXiv.org Artificial Intelligence

2410.12555

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.42)

Add feedback

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Olmo, Jeffrey, Wilson, Jared, Forsey, Max, Hepner, Bryce, Howe, Thomas Vin, Wingate, David

arXiv.org Artificial IntelligenceNov-15-2024

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the $k$ elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network. Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts. By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both $\textit{representations}$, retrospectively, and $\textit{actions}$, prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.

artificial intelligence, latent, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2411.10397

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Can sparse autoencoders be used to decompose and interpret steering vectors?

Mayne, Harry, Yang, Yushi, Mahdi, Adam

arXiv.org Artificial IntelligenceNov-13-2024

Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.

activation, model activation, vector, (14 more...)

arXiv.org Artificial Intelligence

2411.0879

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)

Add feedback

Dissecting Query-Key Interaction in Vision Transformers

Pan, Xu, Philip, Aaron, Xie, Ziqian, Schwartz, Odelia

arXiv.org Artificial IntelligenceMay-26-2024

Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to dissimilar tokens can be beneficial by providing contextual information. We propose to use the Singular Value Decomposition to dissect the query-key interaction (i.e. ${\textbf{W}_q}^\top\textbf{W}_k$). We find that early layers attend more to similar tokens, while late layers show increased attention to dissimilar tokens, providing evidence corresponding to perceptual grouping and contextualization, respectively. Many of these interactions between features represented by singular vectors are interpretable and semantic, such as attention between relevant objects, between parts of an object, or between the foreground and background. This offers a novel perspective on interpreting the attention mechanism, which contributes to understanding how transformer models utilize context and salient features when processing images.

right singular vector, singular mode, singular vector, (14 more...)

arXiv.org Artificial Intelligence

2405.1488

Country:

North America > United States > Florida > Miami-Dade County > Coral Gables (0.04)
North America > United States > Texas > Harris County > Houston (0.04)
North America > United States > Michigan > Ingham County > Lansing (0.04)
(5 more...)

Genre: Research Report > New Finding (0.94)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Lai Loss: A Novel Loss for Gradient Control

Lai, YuFei

arXiv.org Artificial IntelligenceMay-23-2024

In the field of machine learning, traditional regularization methods tend to directly add regularization terms to the loss function. This paper introduces the "Lai loss", a novel loss design that integrates the regularization terms (specifically, gradients) into the traditional loss function through straightforward geometric concepts. This design penalizes the gradients with the loss itself, allowing for control of the gradients while ensuring maximum accuracy. With this loss, we can effectively control the model's smoothness and sensitivity, potentially offering the dual benefits of improving the model's generalization performance and enhancing its noise resistance on specific features. Additionally, we proposed a training method that successfully addresses the challenges in practical applications. We conducted preliminary experiments using publicly available datasets from Kaggle, demonstrating that the design of Lai loss can control the model's smoothness and sensitivity while maintaining stable model performance.

gradient, lai loss, regularization, (15 more...)

arXiv.org Artificial Intelligence

2405.07884

Country:

North America > United States > California (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.50)

Add feedback

Interpreting Neural Networks through the Polytope Lens

Black, Sid, Sharkey, Lee, Grinsztajn, Leo, Winsor, Eric, Braun, Dan, Merizian, Jacob, Parker, Kip, Guevara, Carlos Ramón, Millidge, Beren, Alfour, Gabriel, Leahy, Connor

arXiv.org Artificial IntelligenceNov-22-2022

Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level. What are the fundamental primitives of neural network representations? Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned. But there are clues that neurons and their linear combinations are not the correct fundamental units of description: directions cannot describe how neural networks use nonlinearities to structure their representations. Moreover, many instances of individual neurons and their combinations are polysemantic (i.e. they have multiple unrelated meanings). Polysemanticity makes interpreting the network in terms of neurons or directions challenging since we can no longer assign a specific feature to a neural unit. In order to find a basic unit of description that does not suffer from these problems, we zoom in beyond just directions to study the way that piecewise linear activation functions (such as ReLU) partition the activation space into numerous discrete polytopes. We call this perspective the polytope lens. The polytope lens makes concrete predictions about the behavior of neural networks, which we evaluate through experiments on both convolutional image classifiers and language models. Specifically, we show that polytopes can be used to identify monosemantic regions of activation space (while directions are not in general monosemantic) and that the density of polytope boundaries reflect semantic boundaries. We also outline a vision for what mechanistic interpretability might look like through the polytope lens.

artificial intelligence, machine learning, neuron, (19 more...)

arXiv.org Artificial Intelligence

2211.12312

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback