AITopics | Janiak, Jett

Collaborating Authors

Janiak, Jett

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Arcuschin, Iván, Janiak, Jett, Krzyzanowski, Robert, Rajamanoharan, Senthooran, Nanda, Neel, Conmy, Arthur

arXiv.org Artificial IntelligenceMar-19-2025

Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful, i.e. CoT reasoning does not always reflect how models arrive at conclusions. So far, most of these studies have focused on unfaithfulness in unnatural contexts where an explicit bias has been introduced. In contrast, we show that unfaithful CoT can occur on realistic prompts with no artificial bias. Our results reveal non-negligible rates of several forms of unfaithful reasoning in frontier models: Sonnet 3.7 (16.3%), DeepSeek R1 (5.3%) and ChatGPT-4o (7.0%) all answer a notable proportion of question pairs unfaithfully. Specifically, we find that models rationalize their implicit biases in answers to binary questions ("implicit post-hoc rationalization"). For example, when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify answering Yes to both questions or No to both questions, despite such responses being logically contradictory. We also investigate restoration errors (Dziri et al., 2023), where models make and then silently correct errors in their reasoning, and unfaithful shortcuts, where models use clearly illogical reasoning to simplify solving problems in Putnam questions (a hard benchmark). Our findings raise challenges for AI safety work that relies on monitoring CoT to detect undesired behavior.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2503.08679

Country:

Europe (1.00)
Asia (1.00)
North America > United States > New York (0.28)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (0.68)
Media > Film (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Giglemiani, Giorgi, Petrova, Nora, Mangat, Chatrik Singh, Janiak, Jett, Heimersheim, Stefan

arXiv.org Artificial IntelligenceNov-18-2024

Sparse Auto-Encoders (SAEs) are commonly employed in mechanistic interpretability to decompose the residual stream into monosemantic SAE latents. Recent work demonstrates that perturbing a model's activations at an early layer results in a step-function-like change in the model's final layer activations. Furthermore, the model's sensitivity to this perturbation differs between model-generated (real) activations and random activations. In our study, we assess model sensitivity in order to compare real activations to synthetic activations composed of SAE latents. Our findings indicate that synthetic activations closely resemble real activations when we control for the sparsity and cosine similarity of the constituent SAE latents. This suggests that real activations cannot be explained by a simple "bag of SAE latents" lacking internal structure, and instead suggests that SAE latents possess significant geometric and statistical properties. Notably, we observe that our synthetic activations exhibit less pronounced activation plateaus compared to those typically surrounding real activations.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2409.15019

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.65)

Add feedback

Characterizing stable regions in the residual stream of LLMs

Janiak, Jett, Karwowski, Jacek, Mangat, Chatrik Singh, Giglemiani, Giorgi, Petrova, Nora, Heimersheim, Stefan

arXiv.org Artificial IntelligenceNov-18-2024

We identify stable regions in the residual stream of Transformers, where the model's output remains insensitive to small activation changes, but exhibits high sensitivity at region boundaries. These regions emerge during training and become more defined as training progresses or model size increases. The regions appear to be much larger than previously studied polytopes. Our analysis suggests that these stable regions align with semantic distinctions, where similar prompts cluster within regions, and activations from the same region lead to similar next token predictions. This work provides a promising research direction for understanding the complexity of neural networks, shedding light on training dynamics, and advancing interpretability.

machine learning, natural language, stable region, (19 more...)

arXiv.org Artificial Intelligence

2409.17113

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.90)

Add feedback

An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

Dao, James, Lau, Yeu-Tong, Rager, Can, Janiak, Jett

arXiv.org Artificial IntelligenceNov-9-2023

In recent years, large language models (LLMs) have made impressive gains in capability (Vaswani et al. 2017; Devlin et al. 2019; OpenAI 2023; Radford et al. 2019; Brown et al. 2020), often surpassing expectations (Wei et al. 2022). However, these models remain poorly understood, with their successes and failures largely unexplained. Understanding what LLMs learn and how they generate predictions is therefore an increasingly urgent scientific and practical challenge. Mechanistic interpretability (MI) aims to reverse engineer models into human-understandable algorithms or circuits (Geiger et al. 2021; Olah 2022; Wang et al. 2022), attempting to avoid pitfalls such as illusory understanding. With MI, we can identify and fix model errors (Vig et al. 2020; Hernandez et al. 2022; Meng et al. 2023; Hase et al. 2023), steer its outputs (Li et al. 2023), and explain emergent behaviors (Nanda et al. 2023; Barak et al. 2023). The central goals in MI are (a) localization: identifying the specific model components (attention heads, MLP layers) that the circuit is composed of; and (b) explaining the behavior of these components.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2310.07325

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback