Industry
Shaping Sequence Attractor Schema in Recurrent Neural Networks
Sequence schemas are abstract, reusable knowledge structures that facilitate rapid adaptation and generalization in novel sequential tasks. In both animals and humans, shaping is an efficient way to acquire such schemas, particularly in complex sequential tasks. As a form of curriculum learning, shaping works by progressively advancing from simple subtasks to integrated full sequences, and ultimately enabling generalization across different task variations. Despite the importance of schemas in cognition and shaping in schema acquisition, the underlying neural dynamics at play remain poorly understood. To explore this, we train recurrent neural networks on an odor-sequence task using a shaping protocol inspired by well-established paradigms in experimental neuroscience. Our model provides the first systematic reproduction of key features of schema learning observed in the orbitofrontal cortex, including rapid adaptation to novel tasks, structured neural representation geometry, and progressive dimensionality compression during learning. Crucially, analysis of the trained RNN reveals that the learned schema is implemented through sequence attractors. These attractor dynamics emerge gradually through the shaping process: starting with isolated discrete attractors in simple tasks, evolving into linked sequences, and eventually abstracting into generalizable attractors that capture shared task structure. Moreover, applying our method to a keyword spotting task shows that shaping facilitates the rapid development of sequence attractor schemas, leading to enhanced learning efficiency.
STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation
Off-policy evaluation (OPE) estimates the performance of a target policy using offline data collected from a behavior policy, and is crucial in domains such as robotics or healthcare where direct interaction with the environment is costly or unsafe. Existing OPE methods are ineffective for high-dimensional, long-horizon problems, due to exponential blow-ups in variance from importance weighting or compounding errors from learned dynamics models. To address these challenges, we propose STITCH-OPE, a model-based generative framework that leverages denoising diffusion for long-horizon OPE in high-dimensional state and action spaces. Starting with a diffusion model pre-trained on the behavior data, STITCHOPE generates synthetic trajectories from the target policy by guiding the denoising process using the score function of the target policy. STITCH-OPE proposes two technical innovations that make it advantageous for OPE: (1) prevents overregularization by subtracting the score of the behavior policy during guidance, and (2) generates long-horizon trajectories by stitching partial trajectories together end-to-end. We provide a theoretical guarantee that under mild assumptions, these modifications result in an exponential reduction in variance versus long-horizon trajectory diffusion.
GPLQ: AGeneral, Practical, and Lightning QAT Method for Vision Transformers
Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too. Model quantization, particularly to low bit-widths like 4-bit, aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations. PTQ often incurs substantial accuracy drop, while QAT achieves high accuracy but suffers from prohibitive computational costs, limited generalization to downstream tasks, training instability, and lack of open-source codebase. To address these challenges, this paper introduces General, Practical, and Lightning Quantization (GPLQ), a novel framework designed for efficient and effective ViT quantization. GPLQ is founded on two key empirical insights: the paramount importance of activation quantization and the necessity of preserving the model's original optimization "basin" to maintain generalization. Consequently, GPLQ employs a sequential "activation-first, weights-later" strategy. Stage 1 keeps weights in FP32 while quantizing activations with a feature mimicking loss in only 1 epoch to keep it in the same "basin", thereby preserving generalization.
Towards Evaluating Proactive Risk Awareness of Multimodal Language Models
Human safety awareness gaps often prevent the timely recognition of everyday risks. In solving this problem, a proactive safety artificial intelligence (AI) system would work better than a reactive one. Instead of just reacting to users' questions, it would actively watch people's behavior and their environment to detect potential dangers in advance. Our Proactive Safety Bench (PaSBench2) evaluates this capability through 416 multimodal scenarios (128 image sequences, 288 text logs) spanning 5 safety-critical domains. Evaluation of 36 advanced models reveals fundamental limitations: Top performers like Gemini-2.5-pro
Tighter CMI-Based Generalization Bounds via Stochastic Projection and Quantization
In this paper, we leverage stochastic projection and lossy compression to establish new conditional mutual information (CMI) bounds on the generalization error of statistical learning algorithms. It is shown that these bounds are generally tighter than the existing ones. In particular, we prove that for certain problem instances for which existing MI and CMI bounds were recently shown in Attias et al. [2024] and Livni [2023] to become vacuous or fail to describe the right generalization behavior, our bounds yield suitable generalization guarantees of the order of O(1/ n), where nis the size of the training dataset. Furthermore, we use our bounds to investigate the problem of data "memorization" raised in those works, and which asserts that there are learning problem instances for which any learning algorithm that has good prediction there exist distributions under which the algorithm must "memorize" a big fraction of the training dataset. We show that for every learning algorithm, there exists an auxiliary algorithm that does not memorize and which yields comparable generalization error for any data distribution. In part, this shows that memorization is not necessary for good generalization.
Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback
We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of best-of-both-worlds (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve O(logT) regret in stochastic settings and O( T) regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknowntransition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.
Revisiting 1-peer exponential graph for enhancing decentralized learning efficiency
For communication-efficient decentralized learning, it is essential to employ dynamic graphs designed to improve the expected spectral gap by reducing deviations from global averaging. The 1-peer exponential graph demonstrates its finite-time convergence property-achieved by maximizing the expected spectral gap-but only when the number of nodes n is a power of two. However, its efficiency across any nand the commutativity of mixing matrices remain unexplored. We delve into the principles underlying the 1-peer exponential graph to explain its efficiency across any nand leverage them to develop new dynamic graphs. We propose two new dynamic graphs: the k-peer exponential graph and the nullcascade graph. Notably, the null-cascade graph achieves finite-time convergence for any nwhile ensuring commutativity. Our experiments confirm the effectiveness of these new graphs, particularly the null-cascade graph, in most test settings.
Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation
Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weight layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels across adjacent layers, enabling progressive model size reduction. Finally, we propose a hierarchical distillation protocol that leverages the correspondences between the original and pruned model layers established during pruning, thereby enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b's parameters, the pruned model retains 83% of its original average accuracy.2
Doubly Robust Alignment for Large Language Models
While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice.
US asks Anthropic to block global access to top AI models: Why it matters
The administration of US President Donald Trump has barred foreigners from accessing the top AI models developed by Anthropic, citing national security concerns, underscoring the US government's policy of export controls over advanced technology. The United States' measures come less than a week after Anthropic, the company behind the Claude chatbot, rolled out a new artificial intelligence (AI) model named Claude Fable 5 and Mythos 5. The latest move has reignited the feud between Anthropic and the Trump administration. The San Francisco-based company is suing the administration after it was put on a supply chain blacklist for its refusal to allow the US military to use its AI models for domestic surveillance and fully autonomous weapons systems. Anthropic said the US government gave the company an order citing national security concerns, but did not specify further details.