sequence length
On the Role of Batch Size in Stochastic Conditional Gradient Methods
Islamov, Rustem, Machacek, Roman, Lucchi, Aurelien, Silveti-Falls, Antonio, Gorbunov, Eduard, Cevher, Volkan
We study the role of batch size in stochastic conditional gradient methods under a $μ$-Kurdyka-Łojasiewicz ($μ$-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.
- North America > United States (0.14)
- Asia > Middle East > Jordan (0.04)
- Asia > Middle East > UAE (0.04)
- (6 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Search for Efficient Large Language Models
Large Language Models (LLMs) have long held sway in the realm s of artificial intelligence research. Numerous efficient techniques, inc luding weight pruning, quantization, and distillation, have been embraced to comp ress LLMs, targeting memory reduction and inference acceleration, which unders core the redundancy in LLMs. However, most model compression techniques concen trate on weight optimization, overlooking the exploration of optimal arch itectures. Besides, traditional architecture search methods, limited by the eleva ted complexity with extensive parameters, struggle to demonstrate their effecti veness on LLMs. In this paper, we propose a training-free architecture search fram ework to identify optimal subnets that preserve the fundamental strengths of the o riginal LLMs while achieving inference acceleration. Furthermore, after gen erating subnets that inherit specific weights from the original LLMs, we introduce a reformation algorithm that utilizes the omitted weights to rectify the inher ited weights with a small amount of calibration data. Compared with SOT A training-fr ee structured pruning works that can generate smaller networks, our method dem onstrates superior performance across standard benchmarks. Furthermore, our generated subnets can directly reduce the usage of GPU memory and achieve infer ence acceleration.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Information Technology (0.67)
- Government (0.46)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Texas (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Singapore (0.04)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > California (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Supplementary Material Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation Yingyi Chen
Comments on Theorem 3.2 With the primal problem in (6) in the paper, Theorem 3.2 provides Additionally, [27] presents the optimization w.r.t. a single projection direction in Therefore, our KSVD is more general in the data setups. Remark 3.3, we show that the values can be regarded as playing the role of the dual variables Using data-dependent projection weights does not affect the derivation of the shifted eigenvalue problem in the dual. With the derivations of the primal-dual optimization problems above, the primal-dual model representation of our KSVD problem can be provided correspondingly. Lemma 4.2 evaluates the objective value Moreover, as in the proof of Theorem 3.2, we note that the regularization coefficient This section provides the implementation details of all experiments included in the paper. This will be illustrated in details in the following.Algorithm 1 Learning with Primal-AttentionRequire: X:= [ x UEA Time Series The UEA time series benchmark [31] consists of 30 datasets. Following the setup in [11], we select 10 datasets for evaluation.