Goto

Collaborating Authors

 nsa


Understanding and Exploring the Network with Stochastic Architectures

Neural Information Processing Systems

There is an emerging trend to train a network with stochastic architectures to enable various architectures to be plugged and played during inference. However, the existing investigation is highly entangled with neural architecture search (NAS), limiting its widespread use across scenarios. In this work, we decouple the training of a network with stochastic architectures (NSA) from NAS and provide a first systematical investigation on it as a stand-alone problem. We first uncover the characteristics of NSA in various aspects ranging from training stability, convergence, predictive behaviour, to generalization capacity to unseen architectures. We identify various issues of the vanilla NSA, such as training/test disparity and function mode collapse, and further propose the solutions to these issues with theoretical and empirical insights. We believe that these results could also serve as good heuristics for NAS. Given these understandings, we further apply the NSA with our improvements into diverse scenarios to fully exploit its promise of inference-time architecture stochasticity, including model ensemble, uncertainty estimation and semi-supervised learning. Remarkable performance (e.g., 2.75% error rate and 0.0032 expected calibration error on CIFAR-10) validate the effectiveness of such a model, providing new perspectives of exploring the potential of the network with stochastic architectures, beyond NAS.



Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

Hu, Yuxuan, Tan, Jianchao, Zhang, Jiaqi, Zan, Wen, Sun, Pingwei, Lu, Yifan, Sun, Yerui, Xie, Yuchen, Cai, Xunliang, Zhang, Jing

arXiv.org Artificial Intelligence

In this work, we conduct a systematic analysis of Native Sparse Attention (NSA) and propose targeted improvements that enhance long-context modeling. A key insight is that alternating between local (sliding-window) and global (compression, selective) attention across layers, rather than using fixed patterns, enables more effective propagation of long-range dependencies and substantially boosts performance on long-sequence tasks. Meanwhile, we further refine NSA's branches with Latent Attention that the sliding-window branch is enhanced with Multi-head Latent Attention (MLA) while compression and selective branches adopt Group-head Latent Attention (GLA). These changes reduce KV-cache memory by 50\% versus NSA while improving the model's common-sense reasoning and long-text understanding capabilities. Experiments on models from 340M to 1.3B parameters (trained on 15B and 100B tokens) show our method matches or exceeds full attention and native sparse attention in both common-sense reasoning and long-context understanding tasks.


FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Yan, Ran, Jiang, Youhe, Chen, Zhuoming, Mai, Haohui, Chen, Beidi, Yuan, Binhang

arXiv.org Artificial Intelligence

Recent advance in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), one state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance boost while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA forces a loop order that is only efficient with a relatively large number of query heads in each Grouped Query Attention (GQA) group, whereas existing LLMs widely adopt much smaller number of query heads in each GQA group -- such an inconsistency significantly limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), an alternative kernel implementation that enables efficient NSA computation across a wide range of popular LLMs with varied smaller number of heads in each GQA group on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5x and on average 1.6x kernel-level latency reduction, (ii) up to 1.25x and 1.09x on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36x and 1.11x on average for prefill-phase speedup in LLM generative inference. Github Repo at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.


InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation

Zhao, Weilin, Zhou, Zihan, Su, Zhou, Xiao, Chaojun, Li, Yuxuan, Li, Yanghao, Zhang, Yudi, Zhao, Weilun, Li, Zhen, Huang, Yuxiang, Sun, Ao, Han, Xu, Liu, Zhiyuan

arXiv.org Artificial Intelligence

Long-sequence processing is a critical capability for modern large language models. However, the self-attention mechanism in the standard Transformer architecture faces severe computational and memory bottlenecks when processing long sequences. While trainable sparse attention methods offer a promising solution, existing approaches such as NSA introduce excessive extra parameters and disrupt the conventional \textit{pretrain-on-short, finetune-on-long} workflow, resulting in slow convergence and difficulty in acceleration. To overcome these limitations, we introduce dense-sparse switchable attention framework, termed as InfLLM-V2. InfLLM-V2 is a trainable sparse attention that seamlessly adapts models from short to long sequences. Specifically, InfLLM-V2 reuses dense attention parameters through parameter-free architecture modification, maintaining consistency between short and long sequence processing. Additionally, InfLLM-V2 ensures computational efficiency across all sequence lengths, by using dense attention for short inputs and smoothly transitioning to sparse attention for long sequences. To achieve practical acceleration, we further introduce an efficient implementation of InfLLM-V2 that significantly reduces the computational overhead. Our experiments on long-context understanding and chain-of-thought reasoning demonstrate that InfLLM-V2 is 4$\times$ faster than dense attention while retaining 98.1% and 99.7% of the performance, respectively. Based on the InfLLM-V2 framework, we have trained and open-sourced MiniCPM4.1 (https://huggingface.co/openbmb/MiniCPM4.1-8B), a hybrid reasoning model, providing a reproducible implementation for the research community.


Cindy Cohn Is Leaving the EFF, but Not the Fight for Digital Rights

WIRED

After 25 years at the Electronic Frontier Foundation, Cindy Cohn is stepping down as executive director. In a WIRED interview, she reflects on encryption, AI, and why she's not ready to quit the battle. After a quarter century defending digital rights, Cindy Cohn announced on Tuesday that she is stepping down as executive director of the Electronic Frontier Foundation. Cohn, who has led the San Francisco-based nonprofit since 2015, says she will leave the role later this year, concluding a chapter that helped define the modern fight over online freedom. Cohn first rose to prominence as lead counsel in, the 1990s case that overturned federal restrictions on publishing encryption code. As EFF's legal director and later executive director, she guided the group through legal challenges to government surveillance, reforms to computer crime laws, and efforts to hold corporations accountable for data collection. Over the past decade, EFF has expanded its influence, becoming a central force in shaping the debate over privacy, security, and digital freedom. In an interview with WIRED, Cohn reflected on EFF's foundational encryption victories, its unfinished battles against National Security Agency (NSA) surveillance, and the organization's work protecting independent security researchers.



Natively Trainable Sparse Attention for Hierarchical Point Cloud Datasets

Lapautre, Nicolas, Marchenko, Maria, Patiño, Carlos Miguel, Zhou, Xin

arXiv.org Artificial Intelligence

Unlocking the potential of transformers on datasets of large physical systems depends on overcoming the quadratic scaling of the attention mechanism. This work explores combining the Erwin architecture with the Native Sparse Attention (NSA) mechanism to improve the efficiency and receptive field of transformer models for large-scale physical systems, addressing the challenge of quadratic attention complexity. We adapt the NSA mechanism for non-sequential data, implement the Erwin NSA model, and evaluate it on three datasets from the physical sciences -- cosmology simulations, molecular dynamics, and air pressure modeling -- achieving performance that matches or exceeds that of the original Erwin model. Additionally, we reproduce the experimental results from the Erwin paper to validate their implementation.


Understanding and Exploring the Network with Stochastic Architectures

Neural Information Processing Systems

There is an emerging trend to train a network with stochastic architectures to enable various architectures to be plugged and played during inference. However, the existing investigation is highly entangled with neural architecture search (NAS), limiting its widespread use across scenarios. In this work, we decouple the training of a network with stochastic architectures (NSA) from NAS and provide a first systematical investigation on it as a stand-alone problem. We first uncover the characteristics of NSA in various aspects ranging from training stability, convergence, predictive behaviour, to generalization capacity to unseen architectures. We identify various issues of the vanilla NSA, such as training/test disparity and function mode collapse, and further propose the solutions to these issues with theoretical and empirical insights.


Magnetic Resonance Spectroscopy Quantification Aided by Deep Estimations of Imperfection Factors and Macromolecular Signal

Chen, Dicheng, Lin, Meijin, Liu, Huiting, Li, Jiayu, Zhou, Yirong, Kang, Taishan, Lin, Liangjie, Wu, Zhigang, Wang, Jiazheng, Li, Jing, Lin, Jianzhong, Chen, Xi, Guo, Di, Qu, Xiaobo

arXiv.org Artificial Intelligence

Objective: Magnetic Resonance Spectroscopy (MRS) is an important technique for biomedical detection. However, it is challenging to accurately quantify metabolites with proton MRS due to serious overlaps of metabolite signals, imperfections because of non-ideal acquisition conditions, and interference with strong background signals mainly from macromolecules. The most popular method, LCModel, adopts complicated non-linear least square to quantify metabolites and addresses these problems by designing empirical priors such as basis-sets, imperfection factors. However, when the signal-to-noise ratio of MRS signal is low, the solution may have large deviation. Methods: Linear Least Squares (LLS) is integrated with deep learning to reduce the complexity of solving this overall quantification. First, a neural network is designed to explicitly predict the imperfection factors and the overall signal from macromolecules. Then, metabolite quantification is solved analytically with the introduced LLS. In our Quantification Network (QNet), LLS takes part in the backpropagation of network training, which allows the feedback of the quantification error into metabolite spectrum estimation. This scheme greatly improves the generalization to metabolite concentrations unseen for training compared to the end-to-end deep learning method. Results: Experiments show that compared with LCModel, the proposed QNet, has smaller quantification errors for simulated data, and presents more stable quantification for 20 healthy in vivo data at a wide range of signal-to-noise ratio. QNet also outperforms other end-to-end deep learning methods. Conclusion: This study provides an intelligent, reliable and robust MRS quantification. Significance: QNet is the first LLS quantification aided by deep learning.