Goto

Collaborating Authors

 seq




FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Neural Information Processing Systems

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware---accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.


Discriminately Treating Motion Components Evolves Joint Depth and Ego-Motion Learning

Zhang, Mengtan, Guo, Zizhan, Zhao, Hongbo, Feng, Yi, Xiong, Zuyi, Wang, Yue, Du, Shaoyi, Wang, Hanli, Fan, Rui

arXiv.org Artificial Intelligence

Unsupervised learning of depth and ego-motion, two fundamental 3D perception tasks, has made significant strides in recent years. However, most methods treat ego-motion as an auxiliary task, either mixing all motion types or excluding depth-independent rotational motions in supervision. Such designs limit the incorporation of strong geometric constraints, reducing reliability and robustness under diverse conditions. This study introduces a discriminative treatment of motion components, leveraging the geometric regularities of their respective rigid flows to benefit both depth and ego-motion estimation. Given consecutive video frames, network outputs first align the optical axes and imaging planes of the source and target cameras. Optical flows between frames are transformed through these alignments, and deviations are quantified to impose geometric constraints individually on each ego-motion component, enabling more targeted refinement. These alignments further reformulate the joint learning process into coaxial and coplanar forms, where depth and each translation component can be mutually derived through closed-form geometric relationships, introducing complementary constraints that improve depth robustness. DiMoDE, a general depth and ego-motion joint learning framework incorporating these designs, achieves state-of-the-art performance on multiple public datasets and a newly collected diverse real-world dataset, particularly under challenging conditions. Our source code will be publicly available at mias.group/DiMoDE upon publication.



DataDemo-Paper

Neural Information Processing Systems

For example, here is a print out of the attributes of a single battery. VOLTAGE: 11.1 WEIGHT: 0.23 WIDTH: 35.0 Design Sequence As described in the paper, we convert the design tree into a design sequence which we find to be useful for sequence based machine learning approaches. Here we show an example sequence that corresponds to the tree above. Point Cloud One can easily convert the STL file into a point cloud Here is one we made earlier and have included in the data set.


Rethinking Thinking Tokens: LLMs as Improvement Operators

Madaan, Lovish, Didolkar, Aniket, Gururangan, Suchin, Quan, John, Silva, Ruan, Salakhutdinov, Ruslan, Zaheer, Manzil, Arora, Sanjeev, Goyal, Anirudh

arXiv.org Artificial Intelligence

Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own "thoughts" with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).


A Gapped Scale-Sensitive Dimension and Lower Bounds for Offset Rademacher Complexity

Jia, Zeyu, Polyanskiy, Yury, Rakhlin, Alexander

arXiv.org Machine Learning

The celebrated Vapnik-Chervonenkis dimension vc(F) of a binary-valued function class F and the scale-sensitive dimension vc(F, α) of a real-valued function class F are central notions in the study of empirical processes and convergence of statistical learning methods [VC71, BLW94, KS94]. Sequential analogues of these notions--the Littlestone dimension ldim(F) and the sequential scale-sensitive dimension sfat(F, α)-- have been shown to play an analogously central role in the study of uniform martingale laws and online prediction [Lit88, BDPSS09, RST10]. In this paper, we study "gapped" versions of vc(F,α) and sfat(F, α). The modification yields a dimension that is no larger than the original one, yet can still be shown to control covering numbers in both sequential and non-sequential cases. More importantly, the new notion gives us a more precise control on the functions involved in "shattering" and thus yields non-vacuous lower bounds for offset Rademacher complexities for any uniformly bounded class--both in the classical and sequential cases--and, as a consequence, tighter lower bounds for online prediction problems, such as online regression or transductive learning. Our definition in the non-sequential case can also be seen as a modification of the Natarajan dimension [NT88, Nat89], and was, in fact, introduced in [AB00]. We first motivate the development in this paper on the simpler case of non-sequential data. We start by recalling the definition of the Vapnik-Chervonenkis dimension and its scale-sensitive version.


A Weighted U Statistic for Genetic Association Analyses of Sequencing Data

Wei, Changshuai, Li, Ming, He, Zihuai, Vsevolozhskaya, Olga, Schaid, Daniel J., Lu, Qing

arXiv.org Artificial Intelligence

With advancements in next generation sequencing technology, a massive amount of sequencing data are generated, offering a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, this poses a great challenge for the statistical analysis of high-dimensional sequencing data. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a weighted U statistic, referred to as WU-seq, for the high-dimensional association analysis of sequencing data. Based on a non-parametric U statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used SKAT method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-seq to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol.


Cross-Subject and Cross-Montage EEG Transfer Learning via Individual Tangent Space Alignment and Spatial-Riemannian Feature Fusion

Lai-Tan, Nicole, Gu, Xiao, Philiastides, Marios G., Deligianni, Fani

arXiv.org Artificial Intelligence

--Personalised music-based interventions offer a powerful means of supporting motor rehabilitation by dynamically tailoring auditory stimuli to provide external timekeeping cues, modulate affective states, and stabilise gait patterns. Gener-alisable Brain-Computer Interfaces (BCIs) thus hold promise for adapting these interventions across individuals. However, inter-subject variability in EEG signals, further compounded by movement-induced artefacts and motor planning differences, hinders the generalisability of BCIs and results in lengthy calibration processes. We propose Individual T angent Space Alignment (ITSA), a novel pre-alignment strategy incorporating subject-specific recentering, distribution matching, and supervised rotational alignment to enhance cross-subject generalisation. Using leave-one-subject-out cross-validation, 'ITSA' demonstrates significant performance improvements across subjects and conditions. The parallel fusion approach shows the greatest enhancement over its sequential counterpart, with robust performance maintained across varying data conditions and electrode configurations. The code will be made publicly available at the time of publication. Brain-computer interfaces (BCI) are effective tools for motor rehabilitation and understanding musical stimulus effects on motor function [1]-[4]. In stroke rehabilitation, BCIs decode the user's intention from brain electrical activity to provide sensorimotor feedback and enable control of external devices or motor functions [5], [6]. The use of these BCI strategies for motor rehabilitation has been grouped into either assistive or rehabilitative. The former focuses on bypassing the damaged neuronal pathways to provide alternative control of the external devices, whereas the latter aims to exploit neuro-plasticity by promoting the recovery of damaged pathways and therefore restoring impaired motor functions [5]. Electroen-cephalography signals are often used for the input of BCIs as they provide portable, non-invasive, low-cost solutions and have high temporal resolution [7].