reordering
VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking
Yang, Kichang, Kim, Seonjun, Kim, Minjae, Zhang, Nairan, Zhang, Chi, Lee, Youngki
Edge deployment of large Vision-Language Models (VLMs) increasingly relies on flash-based weight offloading, where activation sparsification is used to reduce I/O overhead. However, conventional sparsification remains model-centric, selecting neurons solely by activation magnitude and neglecting how access patterns influence flash performance. We present Neuron Chunking, an I/O-efficient sparsification strategy that operates on chunks (i.e., groups of contiguous neurons in memory) and couples neuron importance with storage access cost. The method models I/O latency through a lightweight abstraction of access contiguity and selects chunks with high utility, defined as neuron importance normalized by estimated latency. By aligning sparsification decisions with the underlying storage behavior, Neuron Chunking improves I/O efficiency by up to 4.65x and 5.76x on Jetson Orin Nano and Jetson AGX Orin, respectively.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > Texas > Harris County > Houston (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (7 more...)
Exploiting Primacy Effect To Improve Large Language Models
Raimondi, Bianca, Gabbrielli, Maurizio
Large Language Models (LLMs) have become essential in many Natural Language Processing (NLP) tasks, leveraging extensive pre-training and fine-tuning to achieve high accuracy. However, like humans, LLMs exhibit biases, particularly positional biases such as primacy and recency effects, which can influence the accuracy of the answers. The primacy effect-where items presented first are more likely to be remembered or selected-plays a key role in Multiple Choice Question Answering (MCQA), where the order of answer options can affect prediction outcomes. This study focuses on primacy bias in fine-tuned LLMs: We first show that fine-tuning amplifies this bias, probably due to exposure to human-like patterns. Hence, we strategically leverage this effect by reordering response options based on semantic similarity to the query, without requiring knowledge of the correct answer. Our experimental results show that this approach significantly improves performance in MCQA. More generally, our findings underscore the dual nature of biases as both challenges and opportunities, offering insights for bias-aware model design and NLP applications.
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.76)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Asia > Singapore (0.04)
Efficient and Adaptable Overlapping for Computation and Communication via Signaling and Reordering
Hong, Ke, Li, Xiuhong, Liu, Minxu, Mao, Qiuli, Wu, Tianqi, Huang, Zixiao, Chen, Lufang, Wang, Zhong, Zhang, Yichong, Zhu, Zhenhua, Dai, Guohao, Wang, Yu
Generative models have achieved remarkable success across various applications, driving the demand for multi-GPU computing. Inter-GPU communication becomes a bottleneck in multi-GPU computing systems, particularly on consumer-grade GPUs. By exploiting concurrent hardware execution, overlapping computation and communication latency becomes an effective technique for mitigating the communication overhead. We identify that an efficient and adaptable overlapping design should satisfy (1) tile-wise overlapping to maximize the overlapping opportunity, (2) interference-free computation to maintain the original computational performance, and (3) communication agnosticism to reduce the development burden against varying communication primitives. Nevertheless, current designs fail to simultaneously optimize for all of those features. To address the issue, we propose FlashOverlap, which utilizes a novel signaling mechanism: when part of the output finishes, the computation kernel sends a signal to trigger the communication of that part, while continuing the computation of the remaining part (interference-free computation). Consequently, the communication of the finished part and the computation of the remaining part can be overlapped. On top of the signaling mechanism, FlashOverlap comprises two key components: (1) the determination of the signaling timing to boost the overlap efficiency (tile-wise overlapping), and (2) a pre-communication reordering to create the contiguous address for finished data, enabling communication by simply calling NCCL APIs (communication agnosticism), and a post-communication reordering to correct the data order. Experiments show that FlashOverlap achieves up to 1.65x speedup through overlap, outperforming existing works in most cases. Code is available at https://github.com/infinigence/FlashOverlap.
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.05)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
- (2 more...)
- North America > United States > Texas > Harris County > Houston (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
clarification and will make sure to improve on every aspect of our paper
We greatly appreciate the reviewers for the time and expertise they have invested in the reviews. For example, we showed in Section 5.2 that the performance of LaPerm responded monotonically w.r.t. We thank all the reviewers for the careful observations. We will revise the main text and expand the paper's references and appendix We will pursue them in future works.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (9 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Load Balancing for AI Training Workloads
McClure, Sarah, Ratnasamy, Sylvia, Shenker, Scott
We investigate the performance of various load balancing algorithms for large-scale AI training workloads that are running on dedicated infrastructure. The performance of load balancing depends on both the congestion control and loss recovery algorithms, so our evaluation also sheds light on the appropriate choices for those designs as well.
- Information Technology > Communications > Networks (1.00)
- Information Technology > Artificial Intelligence (1.00)