state expansion
Scaling Linear Attention with Sparse State Expansion
Pan, Yuqi, An, Yongqi, Li, Zheng, Chou, Yuhong, Zhu, Ruijie, Wang, Xiaohui, Wang, Mingxuan, Wang, Jinqiao, Li, Guoqi
The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top-$k$ hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Supported by efficient parallelized implementations, our design achieves effective classification and highly discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.5 on AIME24 and 50.2 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.
Understanding the differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks
Sieber, Jerome, Alonso, Carmen Amo, Didier, Alexandre, Zeilinger, Melanie N., Orvieto, Antonio
Softmax attention is the principle backbone of foundation models for various artificial intelligence applications, yet its quadratic complexity in sequence length can limit its inference throughput in long-context settings. To address this challenge, alternative architectures such as linear attention, State Space Models (SSMs), and Recurrent Neural Networks (RNNs) have been considered as more efficient alternatives. While connections between these approaches exist, such models are commonly developed in isolation and there is a lack of theoretical understanding of the shared principles underpinning these architectures and their subtle differences, greatly influencing performance and scalability. In this paper, we introduce the Dynamical Systems Framework (DSF), which allows a principled investigation of all these architectures in a common representation. Our framework facilitates rigorous comparisons, providing new insights on the distinctive characteristics of each model class. For instance, we compare linear attention and selective SSMs, detailing their differences and conditions under which both are equivalent. We also provide principled comparisons between softmax attention and other model classes, discussing the theoretical conditions under which softmax attention can be approximated. Additionally, we substantiate these new insights with empirical validations and mathematical arguments. This shows the DSF's potential to guide the systematic development of future more efficient and scalable foundation models.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
GePA*SE: Generalized Edge-Based Parallel A* for Slow Evaluations
Mukherjee, Shohin, Likhachev, Maxim
Parallel search algorithms have been shown to improve planning speed by harnessing the multithreading capability of modern processors. One such algorithm PA*SE achieves this by parallelizing state expansions, whereas another algorithm ePA*SE achieves this by effectively parallelizing edge evaluations. ePA*SE targets domains in which the action space comprises actions with expensive but similar evaluation times. However, in a number of robotics domains, the action space is heterogenous in the computational effort required to evaluate the cost of an action and its outcome. Motivated by this, we introduce GePA*SE: Generalized Edge-based Parallel A* for Slow Evaluations, which generalizes the key ideas of PA*SE and ePA*SE i.e. parallelization of state expansions and edge evaluations respectively. This extends its applicability to domains that have actions requiring varying computational effort to evaluate them. The open-source code for GePA*SE along with the baselines is available here: https://github.com/shohinm/parallel_search
- North America > United States > Pennsylvania > Centre County > University Park (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
ePA*SE: Edge-based Parallel A* for Slow Evaluations
Mukherjee, Shohin, Aine, Sandip, Likhachev, Maxim
Parallel search algorithms harness the multithreading capability of modern processors to achieve faster planning. One such algorithm is PA*SE (Parallel A* for Slow Expansions), which parallelizes state expansions to achieve faster planning in domains where state expansions are slow. In this work, we propose ePA*SE (Edge-based Parallel A* for Slow Evaluations) that improves on PA*SE by parallelizing edge evaluations instead of state expansions. This makes ePA*SE more efficient in domains where edge evaluations are expensive and need varying amounts of computational effort, which is often the case in robotics. On the theoretical front, we show that ePA*SE provides rigorous optimality guarantees. In addition, ePA*SE can be trivially extended to handle an inflation weight on the heuristic resulting in a bounded suboptimal algorithm w-ePA*SE (Weighted ePA*SE) that trades off optimality for faster planning. On the experimental front, we validate the proposed algorithm in two different planning domains: 1) motion planning for 3D humanoid navigation and 2) task and motion planning for a dual-arm robotic assembly task. We show that ePA*SE can be significantly more efficient than PA*SE and other alternatives. The open-source code for ePA*SE along with the baselines is available here: https://github.com/shohinm/parallel_search
- North America > United States > Pennsylvania > Centre County > University Park (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Law > Environmental Law (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
Learning Heuristic Selection with Dynamic Algorithm Configuration
Speck, David, Biedenkapp, André, Hutter, Frank, Mattmüller, Robert, Lindauer, Marius
A key challenge in satisfying planning is to use multiple heuristics within one heuristic search. An aggregation of multiple heuristic estimates, for example by taking the maximum, has the disadvantage that bad estimates of a single heuristic can negatively affect the whole search. Since the performance of a heuristic varies from instance to instance, approaches such as algorithm selection can be successfully applied. In addition, alternating between multiple heuristics during the search makes it possible to use all heuristics equally and improve performance. However, all these approaches ignore the internal search dynamics of a planning system, which can help to select the most helpful heuristics for the current expansion step. We show that dynamic algorithm configuration can be used for dynamic heuristic selection which takes into account the internal search dynamics of a planning system. Furthermore, we prove that this approach generalizes over existing approaches and that it can exponentially improve the performance of the heuristic search. To learn dynamic heuristic selection, we propose an approach based on reinforcement learning and show empirically that domain-wise learned policies, which take the internal search dynamics of a planning system into account, can exceed existing approaches in terms of coverage.
A Brief History and Recent Achievements in Bidirectional Search
Sturtevant, Nathan R. (University of Denver) | Felner, Ariel (Ben-Gurion University)
The state of the art in bidirectional search has changed significantly a very short time period; we now can answer questions about unidirectional and bidirectional search that until very recently we were unable to answer. This paper is designed to provide an accessible overview of the recent research in bidirectional search in the context of the broader efforts over the last 50 years. We give particular attention to new theoretical results and the algorithms they inspire for optimal and near-optimal node expansions when finding a shortest path.
- Asia > Middle East > Israel (0.04)
- Africa > Eswatini > Manzini > Manzini (0.04)
- North America > United States > Colorado (0.04)
- (2 more...)
TBA*: Time-Bounded A*
Björnsson, Yngvi (Reykjavik University) | Bulitko, Vadim (University of Alberta) | Sturtevant, Nathan (University of Alberta)
Real-time heuristic search algorithms are used for planning by agents in situations where a constant-bounded amount of deliberation time is required for each action regardless of the problem size. Such algorithms interleave their planning and execution to ensure real-time response. Furthermore, to guarantee completeness, they typically store improved heuristic estimates for previously expanded states. Although subsequent planning steps can benefit from updated heuristic estimates, many of the same states are expanded over and over again. Here we propose a variant of the A* algorithm, Time-Bounded A* (TBA*), that guarantees real-time response. In the domain of path-finding on video-game maps TBA* expands an order of magnitude fewer states than traditional real-time search algorithms, while finding paths of comparable quality. It reaches the same level of performance as recent state-of-the-art real-time search algorithms but, unlike these, requires neither state-space abstractions nor pre-computed pattern databases.
- North America > Canada > Alberta (0.14)
- Europe > Iceland > Capital Region > Reykjavik (0.04)