Goto

Collaborating Authors

 hsa


Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Hu, Xiang, Zhou, Zhanchao, Liang, Ruiqi, Li, Zehuan, Wu, Wei, Li, Jianguo

arXiv.org Artificial Intelligence

This work explores the challenge of building "Machines that Can Remember", framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: sparsity, random-access flexibility, and length generalization. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90% accuracy on most in-context retrieval tasks with contexts up to 16M. This report outlines our experimental insights and open problems, contributing a foundation for future research in ultra-long context modeling. Figure 1: Despite being pre-trained with an 8K context window and mid-trained up to 32K, HSA-UltraLong achieves near-perfect accuracy on S-NIAH even at a 16M-token context length. The red dashed line at 32K marks the boundary between in-domain (left) and out-of-domain (right).


Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access

Hu, Xiang, Leng, Jiaqi, Zhao, Jun, Tu, Kewei, Wu, Wei

arXiv.org Artificial Intelligence

A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose Hierarchical Sparse Attention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selects the top-$k$ chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba's huge potential in long-context modeling.


DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

Zhang, Jusheng, Fan, Yijia, Cai, Kaitong, Huang, Zimeng, Sun, Xiaofei, Wang, Jian, Tang, Chengpei, Wang, Keze

arXiv.org Artificial Intelligence

This paper introduces DrDiff, a novel framework for long-text generation that overcomes the efficiency-quality trade-off through three core technologies. First, we design a dynamic expert scheduling mechanism that intelligently allocates computational resources during the diffusion process based on text complexity, enabling more efficient handling of text generation tasks of varying difficulty. Second, we introduce a Hierarchical Sparse Attention (HSA) mechanism that adaptively adjusts attention patterns according to a variety of input lengths, reducing computational complexity from O($n^2$) to O($n$) while maintaining model performance. Finally, we propose a soft absorption guidance optimization strategy that combines with DPM-solver++ to reduce diffusion steps, significantly improving generation speed. Comprehensive experiments on various long-text generation benchmarks demonstrate the superiority of our DrDiff over the existing SOTA methods.


Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

Amizadeh, Saeed, Abdali, Sara, Li, Yinheng, Koishida, Kazuhito

arXiv.org Machine Learning

Transformers and their attention mechanism have been revolutionary in the field of Machine Learning. While originally proposed for the language data, they quickly found their way to the image, video, graph, etc. data modalities with various signal geometries. Despite this versatility, generalizing the attention mechanism to scenarios where data is presented at different scales from potentially different modalities is not straightforward. The attempts to incorporate hierarchy and multi-modality within transformers are largely based on ad hoc heuristics, which are not seamlessly generalizable to similar problems with potentially different structures. To address this problem, in this paper, we take a fundamentally different approach: we first propose a mathematical construct to represent multi-modal, multi-scale data. We then mathematically derive the neural attention mechanics for the proposed construct from the first principle of entropy minimization. We show that the derived formulation is optimal in the sense of being the closest to the standard Softmax attention while incorporating the inductive biases originating from the hierarchical/geometric information of the problem. We further propose an efficient algorithm based on dynamic programming to compute our derived attention mechanism. By incorporating it within transformers, we show that the proposed hierarchical attention mechanism not only can be employed to train transformer models in hierarchical/multi-modal settings from scratch, but it can also be used to inject hierarchical information into classical, pre-trained transformer models post training, resulting in more efficient models in zero-shot manner.


Spring-Brake! Handed Shearing Auxetics Improve Efficiency of Hopping and Standing

Sullivan, Joseph, Good, Ian, Burden, Samuel A., Lipton, Jeffrey Ian

arXiv.org Artificial Intelligence

Energy efficiency is critical to the success of legged robotics. Efficiency is lost through wasted energy during locomotion and standing. Including elastic elements has been shown to reduce movement costs, while including breaks can reduce standing costs. However, adding separate elements for each increases the mass and complexity of a leg, reducing overall system performance. Here we present a novel compliant mechanism using a Handed Shearing Auxetic (HSA) that acts as a spring and break in a monopod hopping robot. The HSA acts as a parallel elastic actuator, reducing electrical power for dynamic hopping and matching the efficiency of state-of-the-art compliant hoppers. The HSA\u2019s auxetic behavior enables dual functionality. During static tasks, it locks under large forces with minimal input power by blocking deformation, creating high friction similar to a capstan mechanism. This allows the leg to support heavy loads without motor torque, addressing thermal inefficiency. The multi-functional design enhances both dynamic and static performance, offering a versatile solution for robotic applications.


Force and Speed in a Soft Stewart Platform

Ketchum, Jake, Avtges, James, Schlafly, Millicent, Young, Helena, Kim, Taekyoung, Truby, Ryan L., Murphey, Todd D.

arXiv.org Artificial Intelligence

--Many soft robots struggle to produce dynamic motions with fast, large displacements. We develop a parallel 6 degree-of-freedom (DoF) Stewart-Gough mechanism using Handed Shearing Auxetic (HSA) actuators. By using soft actuators, we are able to use one third as many mechatronic components as a rigid Stewart platform, while retaining a working payload of 2kg and an open-loop bandwidth greater than 16Hz. We show that the platform is capable of both precise tracing and dynamic disturbance rejection when controlling a ball and sliding puck using a Proportional Integral Derivative (PID) controller . We develop a machine-learning-based kinematics model and demonstrate a functional workspace of roughly 10cm in each translation direction and 28 degrees in each orientation. This 6DoF device has many of the characteristics associated with rigid components--power, speed, and total workspace-- while capturing the advantages of soft mechanisms. Soft robots promise to be safer, more resilient, and more adaptable than their rigid counterparts. This is particularly valuable for systems that are expected to touch and interact with people. However, existing soft 6 DoF parallel mechanisms struggle to produce the forces, displacements, and response times required for mass adoption [1]-[3]. A substantial driver of this capability gap is the many limitations of soft actuator technologies.


Torque Responsive Metamaterials Enable High Payload Soft Robot Arms

Good, Ian, Balaji, Srivatsan, Oh, David, Thomas, Sawyer, Lipton, Jeffrey I.

arXiv.org Artificial Intelligence

Soft robots have struggled to support large forces and moments while also supporting their own weight against gravity. This limits their ability to reach certain configurations necessary for tasks such as inspection and pushing objects up. We have overcome this limitation by creating an electrically driven metamaterial soft arm using handed shearing auxetics (HSA) and bendable extendable torque resistant (BETR) shafts. These use the large force and torque capacity of HSAs and the nestable torque transmission of BETRs to create a strong soft arm. We found that the HSA arm was able to push 2.3 kg vertically and lift more than 600 g when positioned horizontally, supporting 0.33 Nm of torque at the base. The arm is able to move between waypoints while carrying the large payload and demonstrates consistent movement with path variance below 5 mm. The HSA arm's ability to perform active grasping with HSA grippers was also demonstrated, requiring 20 N of pull force to dislodge the object. Finally, we test the arm in a pipe inspection task. The arm is able to locate all the defects while sliding against the inner surface of the pipe, demonstrating its compliance.


Torsion Resistant Strain Limiting Layers Enable High Grip Strength of Electrically-Driven Handed Shearing Auxetic Grippers

Good, Ian, Balaji, Srivatsan, Lipton, Jeffrey I.

arXiv.org Artificial Intelligence

Torsion Resistant Strain Limiting Layers Enable High Grip Strength of Electrically-Driven Handed Shearing Auxetic Grippers Ian Good, Srivatsan Balaji, and Jeffrey I. Lipton Abstract --Soft grippers have demonstrated a strong ability to successfully pick and manipulate many objects. A key limitation to their wider adoption is their inability to grasp larger payloads due to objects slipping out of grasps. We have overcome this limitation by introducing a torsionally rigid strain limiting layer (TR-SLL). This reduces out-of-plane bending while maintaining the gripper's softness and in-plane flexibility. We characterize the design space of the strain limiting layer and Handed Shearing Auxetic (HSA) actuators for a soft gripper using simulation and experiment. The inclusion of the TR-SLL with HSAs enables HSA grippers to be made with a single digit. We found that the use of our TR-SLL HSA gripper enabled pinch grasping of payloads over 1 kg. We demonstrate a lifting capacity of 5 kg when loading using the TR-SLL. We also demonstrate a peak pinch grasp force of 5.8 N, and a peak planar caging force of 14.5 N. Finally, we test the TR-SLL gripper on a suite of 43 YCB objects. We show success on 37 objects demonstrating significant capabilities. I NTRODUCTION Soft robotic fingers have focused on emulating the ability of human and other biotas compliance when bending [1]. However, the key to human's remarkable grip is that our fingers can simultaneously bend while resisting torsion and lateral loading. People rely on a rigid skeleton with discrete joints to provide this selective compliance. We build upon a previous conference paper that introduced the torsion resistant strain limiting layer (TR-SLL) [2]. The TR-SLL provides soft grippers with same torsion resistance of a skeleton without discretization. This work extends this to entirely electrically driven grippers. This allows a single Handed Shearing Auxetic (HSA) to be used in gripper and produce a high holding force. The TR-SLLs constrict bending and serves as a reaction body for the HSA.


Accelerating Complex Disease Treatment through Network Medicine and GenAI: A Case Study on Drug Repurposing for Breast Cancer

Hamed, Ahmed Abdeen, Fandy, Tamer E.

arXiv.org Artificial Intelligence

The objective of this research is to introduce a network specialized in predicting drugs that can be repurposed by investigating real-world evidence sources, such as clinical trials and biomedical literature. Specifically, it aims to generate drug combination therapies for complex diseases (e.g., cancer, Alzheimer's). We present a multilayered network medicine approach, empowered by a highly configured ChatGPT prompt engineering system, which is constructed on the fly to extract drug mentions in clinical trials. Additionally, we introduce a novel algorithm that connects real-world evidence with disease-specific signaling pathways (e.g., KEGG database). This sheds light on the repurposability of drugs if they are found to bind with one or more protein constituents of a signaling pathway. To demonstrate, we instantiated the framework for breast cancer and found that, out of 46 breast cancer signaling pathways, the framework identified 38 pathways that were covered by at least two drugs. This evidence signals the potential for combining those drugs. Specifically, the most covered signaling pathway, ID hsa:2064, was covered by 108 drugs, some of which can be combined. Conversely, the signaling pathway ID hsa:1499 was covered by only two drugs, indicating a significant gap for further research. Our network medicine framework, empowered by GenAI, shows promise in identifying drug combinations with a high degree of specificity, knowing the exact signaling pathways and proteins that serve as targets. It is noteworthy that ChatGPT successfully accelerated the process of identifying drug mentions in clinical trials, though further investigations are required to determine the relationships among the drug mentions.


UNIQA: A Unified Framework for Both Full-Reference and No-Reference Image Quality Assessment

Yun, Yi Ke, Lin, Weisi

arXiv.org Artificial Intelligence

The human visual system (HVS) is effective at distinguishing low-quality images due to its ability to sense the distortion level and the resulting semantic impact. Prior research focuses on developing dedicated networks based on the presence and absence of pristine images, respectively, and this results in limited application scope and potential performance inconsistency when switching from NR to FR IQA. In addition, most methods heavily rely on spatial distortion modeling through difference maps or weighted features, and this may not be able to well capture the correlations between distortion and the semantic impact it causes. To this end, we aim to design a unified network for both Full-Reference (FR) and No-Reference (NR) IQA via semantic impact modeling. Specifically, we employ an encoder to extract multi-level features from input images. Then a Hierarchical Self-Attention (HSA) module is proposed as a universal adapter for both FR and NR inputs to model the spatial distortion level at each encoder stage. Furthermore, considering that distortions contaminate encoder stages and damage image semantic meaning differently, a Cross-Scale Cross-Attention (CSCA) module is proposed to examine correlations between distortion at shallow stages and deep ones. By adopting HSA and CSCA, the proposed network can effectively perform both FR and NR IQA. Extensive experiments demonstrate that the proposed simple network is effective and outperforms the relevant state-of-the-art FR and NR methods on four synthetic-distorted datasets and three authentic-distorted datasets.