limit
Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models
Transformer architecture has become the fundamental element of the widespread natural language processing~(NLP) models. With the trends of large NLP models, the increasing memory and computation costs hinder their efficient deployment on resource-limited devices. Therefore, transformer quantization attracts wide research interest. Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. However, their proposed methods increase the computation overhead and still leave the outliers there. To fundamentally address this problem, this paper delves into the inherent inducement and importance of the outliers. We discover that $\boldsymbol \gamma$ in LayerNorm (LN) acts as a sinful amplifier for the outliers, and the importance of outliers varies greatly where some outliers provided by a few tokens cover a large area but can be clipped sharply without negative impacts. Motivated by these findings, we propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping.
Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point
In this paper, we explore the limits of Microsoft Floating Point (MSFP), a new class of datatypes developed for production cloud-scale inferencing on custom hardware. Through the co-evolution of hardware design and algorithms, MSFP achieves accuracy comparable to or better than industry standards Bfloat16 and INT8 at 3x and 4x lower cost, respectively. MSFP incurs negligible impact to accuracy (<1%), requires no changes to the model topology, and is integrated with a mature cloud production pipeline. MSFP supports various classes of deep learning models including CNNs, RNNs, and Transformers without modification. Finally, we characterize the accuracy and implementation of MSFP and demonstrate its efficacy on a number of production scenarios, including models that power major online scenarios such as web search, question-answering, and image classification.
Limits on Testing Structural Changes in Ising Models
We present novel information-theoretic limits on detecting sparse changes in Isingmodels, a problem that arises in many applications where network changes canoccur due to some external stimuli. We show that the sample complexity fordetecting sparse changes, in a minimax sense, is no better than learning the entiremodel even in settings with local sparsity. This is a surprising fact in light of priorwork rooted in sparse recovery methods, which suggest that sample complexityin this context scales only with the number of network changes. To shed light onwhen change detection is easier than structured learning, we consider testing ofedge deletion in forest-structured graphs, and high-temperature ferromagnets ascase studies. We show for these that testing of small changes is similarly hard, buttesting oflargechanges is well-separated from structure learning. These resultsimply that testing of graphical models may not be amenable to concepts such asrestricted strong convexity leveraged for sparsity pattern recovery, and algorithmdevelopment instead should be directed towards detection of large changes.
Exploring the Limits of Out-of-Distribution Detection
Near out-of-distribution detection (OOD) is a major challenge for deep neural networks. We demonstrate that large-scale pre-trained transformers can significantly improve the state-of-the-art (SOTA) on a range of near OOD tasks across different data modalities. For instance, on CIFAR-100 vs CIFAR-10 OOD detection, we improve the AUROC from 85% (current SOTA) to more than 96% using Vision Transformers pre-trained on ImageNet21k. On a challenging genomics OOD detection benchmark, we improve the AUROC from 66% to 77% using transformer and unsupervised pre-training. To further improve performance, we explore the few-shot outlier exposure setting where a few examples from outlier classes may be available; we show that pre-trained transformers are particularly well-suited for outlier exposure, and that the AUROC of OOD detection on CIFAR-100 vs CIFAR-10 can be improved to 98.7% with just 1 image per OOD class, and 99.46% with 10 images per OOD class. For multi-modal image-text pre-trained transformers such as CLIP, we explore a new way of using just the names of outlier classes as a sole source of information without any accompanying images, and show that this outperforms previous SOTA on standard OOD benchmark tasks.
The Limits of Post-Selection Generalization
While statistics and machine learning offers numerous methods for ensuring generalization, these methods often fail in the presence of ---the common practice in which the choice of analysis depends on previous interactions with the same dataset. A recent line of work has introduced powerful, general purpose algorithms that ensure a property called (Cummings et al., COLT'16), which says that no person when given the output of the algorithm should be able to find any statistic for which the data differs significantly from the population it came from. In this work we show several limitations on the power of algorithms satisfying post hoc generalization. First, we show a tight lower bound on the error of any algorithm that satisfies post hoc generalization and answers adaptively chosen statistical queries, showing a strong barrier to progress in post selection data analysis. Second, we show that post hoc generalization is not closed under composition, despite many examples of such algorithms exhibiting strong composition properties.
The Shutdown Is Pushing Air Safety Workers to the Limit
Federal employees say that flying is still safe despite the strain on air traffic controllers. But expect even more airport delays ahead. It hasn't been a good year for federal aviation safety workers. January saw the worst US commercial airline disaster in decades, quickly followed by sudden layoffs, staffing shortfalls, major technology glitches at one of the nation's busiest airports, and short timelines to rebuild the systems that govern national airspace. It somehow got worse this month, when a stalemate between congressional Republicans and Democrats led to a government shutdown.
- Asia > North Korea (0.05)
- Asia > China (0.05)
- Africa (0.05)
- (4 more...)
- Transportation > Infrastructure & Services (1.00)
- Transportation > Air (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
Efficient Test-Time Scaling for Small Vision-Language Models
Kaya, Mehmet Onurcan, Elliott, Desmond, Papadopoulos, Dim P.
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem
Recent work has documented striking heterogeneity in the performance of state-of-the-art vision language models (VLMs), including both multimodal language models and text-to-image models. These models are able to describe and generate a diverse array of complex, naturalistic images, yet they exhibit surprising failures on basic multi-object reasoning tasks -- such as counting, localization, and simple forms of visual analogy -- that humans perform with near perfect accuracy. To better understand this puzzling pattern of successes and failures, we turn to theoretical accounts of the binding problem in cognitive science and neuroscience, a fundamental problem that arises when a shared set of representational resources must be used to represent distinct entities (e.g., to represent multiple objects in an image), necessitating the use of serial processing to avoid interference. We find that many of the puzzling failures of state-of-the-art VLMs can be explained as arising due to the binding problem, and that these failure modes are strikingly similar to the limitations exhibited by rapid, feedforward processing in the human brain.
The Limits of Transfer Reinforcement Learning with Latent Low-rank Structure
Many reinforcement learning (RL) algorithms are too costly to use in practice due to the large sizes S,A of the problem's state and action space. To resolve this issue, we study transfer RL with latent low rank structure. We consider the problem of transferring a latent low rank representation when the source and target MDPs have transition kernels with Tucker rank (S, d, A), (S,S, d), (d, S, A), or (d, d, d) . In each setting, we introduce the transfer-ability coefficient \alpha that measures the difficulty of representational transfer. Our algorithm learns latent representations in each source MDP and then exploits the linear structure to remove the dependence on S, A, or SA in the target MDP regret bound.