Goto

Collaborating Authors

 Africa




Parallelizing Linear Transformers with the Delta Rule over Sequence Length Songlin Y ang Bailin Wang Y u Zhang Yikang Shen Y oon Kim Massachusetts Institute of Technology Soochow University

Neural Information Processing Systems

Transformers with linear attention (i.e., linear transfor mers) and state-space models have recently been suggested as a viable linear-time alt ernative to transformers with softmax attention. However, these models still underp erform transformers especially on tasks that require in-context retrieval. Whil e more expressive variants of linear transformers which replace the additive upda te in linear transformers with the delta rule [DeltaNet; 101 ] have been found to be more effective at associative recall, existing algorithms for training such mode ls do not parallelize over sequence length and are thus inefficient to train on modern ha rdware. This work describes a hardware-efficient algorithm for training line ar transformers with the delta rule, which exploits a memory-efficient representati on for computing products of Householder matrices [ 11 ]. This algorithm allows us to scale up DeltaNet to standard language modeling settings. We train a 1.3B mode l for 100B tokens and find that it outperforms recent linear-time baselines su ch as Mamba [ 31 ] and GLA [ 124 ] in terms of perplexity and zero-shot performance on downst ream tasks. We also experiment with two hybrid models which combine Delt aNet layers with (1) sliding-window attention layers every other layer or (2) two global attention layers, and find that these hybrids outperform strong transf ormer baselines.


Agent Planning with World Knowledge Model

Neural Information Processing Systems

Imitating humans' mental world knowledge model which provides global prior knowledge before the task and maintains local dynamic knowledge during the task, in this paper, we introduce parametric W orld K nowledge M odel ( WKM) to facilitate agent


Supplementary Material and Datasheet: Off to new Shores: A Dataset & Benchmark for (near-)coastal Flood Inundation Forecasting Contents

Neural Information Processing Systems

This supplementary document follows the Datasheets for Datasets template of (8) to document the Global Flood Forecasting (GFF) dataset and its creation. Further resources are provided: in the accompanying publication https://arxiv.org/abs/2409.18591 in the GitHub repository https://github.com/Multihuntr/GFF



Multi-Group Proportional Representation in Retrieval

Neural Information Processing Systems

Current approaches to mitigate these representational harms balance the number of retrieved items across population groups defined by a small number of (often binary) attributes. However, most existing methods overlook intersectional groups determined by combinations of group attributes, such as gender, race, and ethnicity.



The Shutdown Is Pushing Air Safety Workers to the Limit

WIRED

Federal employees say that flying is still safe despite the strain on air traffic controllers. But expect even more airport delays ahead. It hasn't been a good year for federal aviation safety workers. January saw the worst US commercial airline disaster in decades, quickly followed by sudden layoffs, staffing shortfalls, major technology glitches at one of the nation's busiest airports, and short timelines to rebuild the systems that govern national airspace. It somehow got worse this month, when a stalemate between congressional Republicans and Democrats led to a government shutdown.