flip rate
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > Montserrat (0.04)
- (5 more...)
Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria
Garg, Kartik, Mishra, Shourya, Sinha, Kartikeya, Singh, Ojaswi Pratap, Chopra, Ayush, Rai, Kanishk, Sheikh, Ammar, Maheshwari, Raghav, Chadha, Aman, Jain, Vinija, Das, Amitava
Alignment faking is a form of strategic deception in AI in which models selectively comply with training objectives when they infer that they are in training, while preserving different behavior outside training. The phenomenon was first documented for Claude 3 Opus and later examined across additional large language models. In these setups, the word "training" refers to simulated training via prompts without parameter updates, so the observed effects are context conditioned shifts in behavior rather than preference learning. We study the phenomenon using an evaluation framework that compares preference optimization methods (BCO, DPO, KTO, and GRPO) across 15 models from four model families, measured along three axes: safety, harmlessness, and helpfulness. Our goal is to identify what causes alignment faking and when it occurs.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Kansas > Rawlins County (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > India > Goa (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
On the Structure of Floating-Point Noise in Batch-Invariant GPU Matrix Multiplication
Floating-point non-associativity makes fundamental deep learning operations, such as matrix multiplication (matmul) on GPUs, inherently non-deterministic. Despite this, the statistical structure of the resulting numerical error remains poorly understood. A common working assumption is that these errors behave as independent and identically distributed (i.i.d.) Gaussian noise. In this paper, we empirically test this assumption and show that it fails to describe real GPU behavior. By comparing outputs of single-input and batched matmuls, we find that while the i.i.d. model predicts non-zero output instability, empirical results show a 0.00% prediction flip rate. Through covariance analysis, we uncover the cause: the floating-point error is structured and highly correlated. For float16, nearly 50% of the total error variance lies in off-diagonal terms, revealing that the noise behaves as a coordinated, directional perturbation rather than random static. This result challenges the prevailing stochastic view of numerical noise and provides a principled foundation for analyzing deep learning reliability under hardware non-determinism.
Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning
Churina, Svetlana, Chebrolu, Niranjan, Jaidka, Kokil
Large language models (LLMs) continually evolve through pre-training on ever-expanding web data, but this adaptive process also exposes them to subtle forms of misinformation. While prior work has explored data poisoning during static pre-training, the effects of such manipulations under continual pre-training remain largely unexplored. Drawing inspiration from the illusory truth effect in human cognition - where repeated exposure to falsehoods increases belief in their accuracy - we ask whether LLMs exhibit a similar vulnerability. We investigate whether repeated exposure to false but confidently stated facts can shift a model's internal representation away from the truth. We introduce Layer of Truth, a framework and dataset for probing belief dynamics in continually trained LLMs. By injecting controlled amounts of poisoned data and probing intermediate representations across checkpoints, model scales, and question types, we quantify when and how factual beliefs shift. Our findings reveal that even minimal exposure can induce persistent representational drift in well-established facts, with susceptibility varying across layers and model sizes. These results highlight an overlooked vulnerability of continually updated LLMs: their capacity to internalize misinformation analogously to humans, underscoring the need for robust monitoring of factual integrity during model updates.
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Asia > Indonesia > Bali (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- (3 more...)
- Leisure & Entertainment > Sports > Hockey (1.00)
- Media > News (0.69)
- Health & Medicine > Therapeutic Area > Neurology > Headaches (0.46)
Finding Culture-Sensitive Neurons in Vision-Language Models
Zhao, Xiutian, Choenni, Rochelle, Saxena, Rohit, Titov, Ivan
Despite their impressive performance, vision-language models (VLMs) still struggle on culturally situated inputs. To understand how VLMs process culturally grounded information, we study the presence of culture-sensitive neurons, i.e. neurons whose activations show preferential sensitivity to inputs associated with particular cultural contexts. We examine whether such neurons are important for culturally diverse visual question answering and where they are located. Using the CVQA benchmark, we identify neurons of culture selectivity and perform causal tests by deactivating the neurons flagged by different identification methods. Experiments on three VLMs across 25 cultural groups demonstrate the existence of neurons whose ablation disproportionately harms performance on questions about the corresponding cultures, while having minimal effects on others. Moreover, we propose a new margin-based selector - Contrastive Activation Selection (CAS), and show that it outperforms existing probability- and entropy-based methods in identifying culture-sensitive neurons. Finally, our layer-wise analyses reveals that such neurons tend to cluster in certain decoder layers. Overall, our findings shed new light on the internal organization of multimodal representations.
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > Montserrat (0.04)
- (5 more...)
Supplementary to " Part-dependent Label Noise: Towards Instance-dependent Label Noise "
We begin by introducing notation. In the main paper (Section 3), we show how to approximate instance-dependent transition matrix by exploiting part-dependent transition matrices. Note that it is more realistic that different instances have different flip rates. However, it is hard to identify these parameters without any assumption. In the main paper (Section 4), we present the experimental results on four synthetic noisy datasets, i.e., F-MNIST, SVHN, CIF AR-10, and NEWS .
The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models
Conti, Lina, Fucci, Dennis, Gaido, Marco, Negri, Matteo, Wisniewski, Guillaume, Bentivogli, Luisa
Contrastive explanations, which indicate why an AI system produced one output (the target) instead of another (the foil), are widely regarded in explainable AI as more informative and interpretable than standard explanations. However, obtaining such explanations for speech-to-text (S2T) generative models remains an open challenge. Drawing from feature attribution techniques, we propose the first method to obtain contrastive explanations in S2T by analyzing how parts of the input spectrogram influence the choice between alternative outputs. Through a case study on gender assignment in speech translation, we show that our method accurately identifies the audio features that drive the selection of one gender over another. By extending the scope of contrastive explanations to S2T, our work provides a foundation for better understanding S2T models.
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.68)