Goto

Collaborating Authors

 latent attention


Latent Attention For If-Then Program Synthesis

Neural Information Processing Systems

Automatic translation from natural language descriptions into programs is a long-standing challenging problem. In this work, we consider a simple yet important sub-problem: translation from textual descriptions to If-Then programs. We devise a novel neural network architecture for this task which we train end-to-end. Specifically, we introduce Latent Attention, which computes multiplicative weights for the words in the description in a two-stage process with the goal of better leveraging the natural language structures that indicate the relevant parts for predicting program elements. Our architecture reduces the error rate by 28.57% compared to prior art. We also propose a one-shot learning scenario of If-Then program synthesis and simulate it with our existing dataset. We demonstrate a variation on the training procedure for this scenario that outperforms the original procedure, significantly closing the gap to the model trained with all data.


Latent Attention For If-Then Program Synthesis

Neural Information Processing Systems

Automatic translation from natural language descriptions into programs is a long-standing challenging problem. In this work, we consider a simple yet important sub-problem: translation from textual descriptions to If-Then programs. We devise a novel neural network architecture for this task which we train end-to-end. Specifically, we introduce Latent Attention, which computes multiplicative weights for the words in the description in a two-stage process with the goal of better leveraging the natural language structures that indicate the relevant parts for predicting program elements. Our architecture reduces the error rate by 28.57% compared to prior art. We also propose a one-shot learning scenario of If-Then program synthesis and simulate it with our existing dataset. We demonstrate a variation on the training procedure for this scenario that outperforms the original procedure, significantly closing the gap to the model trained with all data.


Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

Hu, Yuxuan, Tan, Jianchao, Zhang, Jiaqi, Zan, Wen, Sun, Pingwei, Lu, Yifan, Sun, Yerui, Xie, Yuchen, Cai, Xunliang, Zhang, Jing

arXiv.org Artificial Intelligence

In this work, we conduct a systematic analysis of Native Sparse Attention (NSA) and propose targeted improvements that enhance long-context modeling. A key insight is that alternating between local (sliding-window) and global (compression, selective) attention across layers, rather than using fixed patterns, enables more effective propagation of long-range dependencies and substantially boosts performance on long-sequence tasks. Meanwhile, we further refine NSA's branches with Latent Attention that the sliding-window branch is enhanced with Multi-head Latent Attention (MLA) while compression and selective branches adopt Group-head Latent Attention (GLA). These changes reduce KV-cache memory by 50\% versus NSA while improving the model's common-sense reasoning and long-text understanding capabilities. Experiments on models from 340M to 1.3B parameters (trained on 15B and 100B tokens) show our method matches or exceeds full attention and native sparse attention in both common-sense reasoning and long-context understanding tasks.


Latent Attention For If-Then Program Synthesis

Chang Liu, Xinyun Chen, Eui Chul Shin, Mingcheng Chen, Dawn Song

Neural Information Processing Systems

Automatic translation from natural language descriptions into programs is a longstanding challenging problem. In this work, we consider a simple yet important sub-problem: translation from textual descriptions to If-Then programs. We devise a novel neural network architecture for this task which we train end-to-end. Specifically, we introduce Latent Attention, which computes multiplicative weights for the words in the description in a two-stage process with the goal of better leveraging the natural language structures that indicate the relevant parts for predicting program elements. Our architecture reduces the error rate by 28.


Muon: Training and Trade-offs with Latent Attention and MoE

Mehta, Sushant, Dandekar, Raj, Dandekar, Rajat, Panat, Sreedath

arXiv.org Artificial Intelligence

We present a comprehensive theoretical and empirical study of the Muon optimizer for training transformers only with a small to medium decoder (30M - 200M parameters), with an emphasis on its mathematical foundations, convergence properties and synergistic interactions with modern architectural optimizations. Building on recent work showing Muon's scalability, we provide rigorous theoretical analysis including: (i)showing the convergence rate under standard assumptions, (ii) spectral regularization properties that prevent gradient explosion, (iii) connection to natural gradient descent on the Stiefel manifold, and (iv) equivalence to steepest gradient descent under the spectral norm. Crucially, we demonstrate that Muon expands the Pareto frontier in the compute-time trade-off by maintaining superior data efficiency at large batch sizes, a key finding of~\cite{essentialai2025muon} that we validate across our model scales. Empirically, Muon reaches the target loss with 48-52\% of the training calculated by AdamW while maintaining or improving the final perplexity, consistent with larger-scale results. When combined with Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE), we observe multiplicative efficiency gains: MLA+MoE+Muon achieves 68\% memory reduction and 3.2$\times$ inference speedup, while improving perplexity by 8-12\%. We provide detailed procedures on 15 architectural and optimizer components, stability analyzes across 100+ training runs, and practical implementation guidelines including Newton-Schulz coefficients $(3.4445, -4.7750, 2.0315)$ optimized by~\cite{su2024muonblog}. Our theoretical analysis and comprehensive experiments establish Muon as a principled, robust alternative to AdamW that particularly excels when combined with modern efficiency techniques and large-batch training regimes.


EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs

Cai, Zhengge, Hou, Haowen

arXiv.org Artificial Intelligence

Reducing the key-value (KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs), especially under latency and memory constraints. While Multi-Head Attention (MHA) offers strong representational power, it incurs significant memory overhead. Recent work on Multi-head Latent Attention (MLA) mitigates this by compressing KV representations into a shared latent space, achieving a better trade-off between performance and cache efficiency. While MLA already achieves significant KV cache reduction, the scope for further compression remains limited without performance loss. In this paper, we propose \textbf{Embedding-Gated Multi-head Latent Attention (EG-MLA)}, a novel extension of MLA that further reduces KV cache size while enhancing representational expressiveness. EG-MLA introduces a token-specific embedding gating mechanism applied in the latent space, enabling fine-grained modulation of compressed KV vectors with minimal additional computation. Compared to MHA, EG-MLA achieves over 91.6\% reduction in KV cache size with negligible performance degradation. Relative to MLA, EG-MLA consistently improves task accuracy across diverse reasoning benchmarks while achieving up to 59.9\% additional memory savings. Our theoretical analysis highlights how embedding gating induces implicit high-order interactions, and empirical evaluations demonstrate robust generalization across model scales and compression regimes. Notably, we successfully scale EG-MLA to over 1 billion parameters, demonstrating its practical viability for large-scale LLM deployment. These results establish EG-MLA as a memory- and compute-efficient attention mechanism that enables scalable, high-performance inference in modern LLMs.


TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference

Tang, Xiaojuan, Meng, Fanxu, Tang, Pingzhi, Wang, Yuxuan, Yin, Di, Sun, Xing, Zhang, Muhan

arXiv.org Artificial Intelligence

Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, compresses key-value states into a low-rank latent vector, caching only this vector to reduce memory. In tensor parallelism (TP), however, attention heads are computed across multiple devices, and each device must load the full cache, eroding the advantage of MLA over Grouped Query Attention (GQA). We propose Tensor-Parallel Latent Attention (TPLA): a scheme that partitions both the latent representation and each head's input dimension across devices, performs attention independently per shard, and then combines results with an all-reduce. TPLA preserves the benefits of a compressed KV cache while unlocking TP efficiency. Unlike Grouped Latent Attention (GLA), every head in TPLA still leverages the full latent representation, maintaining stronger representational capacity. TPLA is drop-in compatible with models pre-trained using MLA: it supports MLA-style prefilling and enables efficient tensor-parallel decoding without retraining. Applying simple orthogonal transforms -- e.g., the Hadamard transform or PCA -- before TP slicing further mitigates cross-shard interference, yielding minimal accuracy degradation. By reducing the per-device KV cache for DeepSeek-V3 and Kimi-K2, we achieve 1.79x and 1.93x speedups, respectively, at a 32K-token context length while maintaining performance on commonsense and LongBench benchmarks. TPLA can be implemented with FlashAttention-3, enabling practical end-to-end acceleration.


A Review of DeepSeek Models' Key Innovative Techniques

Wang, Chengen, Kantarcioglu, Murat

arXiv.org Artificial Intelligence

DeepSeek-V3 and DeepSeek-R1 are leading open-source Large Language Models (LLMs) for general-purpose tasks and reasoning, achieving performance comparable to state-of-the-art closed-source models from companies like OpenAI and Anthropic -- while requiring only a fraction of their training costs. Understanding the key innovative techniques behind DeepSeek's success is crucial for advancing LLM research. In this paper, we review the core techniques driving the remarkable effectiveness and efficiency of these models, including refinements to the transformer architecture, innovations such as Multi-Head Latent Attention and Mixture of Experts, Multi-Token Prediction, the co-design of algorithms, frameworks, and hardware, the Group Relative Policy Optimization algorithm, post-training with pure reinforcement learning and iterative training alternating between supervised fine-tuning and reinforcement learning. Additionally, we identify several open questions and highlight potential research opportunities in this rapidly advancing field.


TransMLA: Multi-Head Latent Attention Is All You Need

Meng, Fanxu, Yao, Zengwei, Zhang, Muhan

arXiv.org Artificial Intelligence

Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce TransMLA, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.