AITopics | max 4

6d0bf1265ea9635fb4f9d56f16d7efb2-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-13-2026, 18:57:56 GMT

Supplementary Materials for "Don't be so Monotone: Relaxing Stochastic Line Search in Over-Parameterized Models" Appendix A The Algorithm Appendix B Convergence Rates Appendix B.1 Rate of Convergence for Strongly Convex Functions Appendix B.2 Rate of Convergence for Convex Functions Appendix B.3 Rate of Convergence for Functions Satisfying the PL Condition Appendix B.4 Common Lemmas Appendix B.5 The Polyak Step Size is Bounded Appendix C Experimental details Appendix D Plots Completing the Figures in the Main Paper Appendix D.1 Comparison between PoNoS and the state-of-the-art Appendix D.2 A New Resetting Technique Appendix D.3 Time Comparison Appendix D.4 Experiments on Convex Losses Appendix D.5 Experiments on Transformers Appendix E Additional Plots Appendix E.1 Study on the Choice of c: Theory (0.5) vs Practice (0.1) Appendix E.2 Study on the Line Search Choice: V arious Nonmonotone Adaptations Appendix E.3 Zoom in on the Amount of Backtracks Appendix E.4 Study on the Choice of η In this section, we give the details of our proposed algorithm PoNoS. Training machine learning models (e.g., neural networks) entails solving the following finite sum problem: min Before that, we establish the following auxiliary result. The following Lemma shows the importance of the interpolation property. Lemma 4. W e assume interpolation and that f Let us now analyze case 2). Let us now show that b < 1. B.2 Rate of Convergence for Convex Functions In this subsection, we prove a O ( The above bound will be now proven also for case 2).

artificial intelligence, deep learning, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > Canada > British Columbia (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

87be61bf9338389702712f5e9754a986-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 08:38:42 GMT

max 4, projection, switchhead, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Germany > Berlin (0.04)
Europe > France (0.04)
(14 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

6d0bf1265ea9635fb4f9d56f16d7efb2-Supplemental-Conference.pdf

Neural Information Processing SystemsOct-8-2025, 20:59:14 GMT

Supplementary Materials for "Don't be so Monotone: Relaxing Stochastic Line Search in Over-Parameterized Models" Appendix A The Algorithm Appendix B Convergence Rates Appendix B.1 Rate of Convergence for Strongly Convex Functions Appendix B.2 Rate of Convergence for Convex Functions Appendix B.3 Rate of Convergence for Functions Satisfying the PL Condition Appendix B.4 Common Lemmas Appendix B.5 The Polyak Step Size is Bounded Appendix C Experimental details Appendix D Plots Completing the Figures in the Main Paper Appendix D.1 Comparison between PoNoS and the state-of-the-art Appendix D.2 A New Resetting Technique Appendix D.3 Time Comparison Appendix D.4 Experiments on Convex Losses Appendix D.5 Experiments on Transformers Appendix E Additional Plots Appendix E.1 Study on the Choice of c: Theory (0.5) vs Practice (0.1) Appendix E.2 Study on the Line Search Choice: V arious Nonmonotone Adaptations Appendix E.3 Zoom in on the Amount of Backtracks Appendix E.4 Study on the Choice of η In this section, we give the details of our proposed algorithm PoNoS. Training machine learning models (e.g., neural networks) entails solving the following finite sum problem: min Before that, we establish the following auxiliary result. The following Lemma shows the importance of the interpolation property. Lemma 4. W e assume interpolation and that f Let us now analyze case 2). Let us now show that b < 1. B.2 Rate of Convergence for Convex Functions In this subsection, we prove a O ( The above bound will be now proven also for case 2).

artificial intelligence, deep learning, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > Canada > British Columbia (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Csordás, Róbert, Piękos, Piotr, Irie, Kazuki, Schmidhuber, Jürgen

arXiv.org Artificial IntelligenceDec-14-2023

The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length. Existing approximation methods usually underperform and fail to obtain significant speedups in practice. Here we present SwitchHead--a novel method that reduces both compute and memory requirements and achieves wall-clock speedup, while matching the language modeling performance of baseline Transformers with the same parameter budget. Switch-Head uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. Our novel attention can also be combined with MoE MLP layers, resulting in an efficient fully-MoE "SwitchAll" Transformer model. Large language models (LLMs) have shown remarkable capabilities (Radford et al., 2019; Brown et al., 2020; OpenAI, 2022; 2023) and great versatility (Bubeck et al., 2023). However, training enormous Transformers (Vaswani et al., 2017; Schmidhuber, 1992) requires a considerable amount of computing power and memory, which is not accessible to most researchers, academic institutions, and even companies. Even running them in inference mode, which is much less resource-intensive, requires significant engineering effort (Gerganov, 2023). Accelerating big Transformers remains an important open research question. However, in these works, the parameter efficiency of MoEs has not been studied; MoE models have been typically compared to dense baselines with the same number of FLOPs but with much less parameters.

attention matrix, max 4, switchhead, (14 more...)

arXiv.org Artificial Intelligence

2312.07987

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Asia > Middle East > Jordan (0.04)
(11 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.86)

Add feedback

Convergence Results For Q-Learning With Experience Replay

Szlak, Liran, Shamir, Ohad

arXiv.org Artificial IntelligenceDec-8-2021

Q-learning is a well-known and commonly used algorithm for reinforcement learning. In recent years, a technique referred to as experience replay [9, 11] has been suggested as a mechanism to improve Q-learning by allowing the learner to access previous experiences, and use them offline as if they were examples currently sampled from the world. It has been suggested that using past experiences in such a way might allow Q-learning to better converge to the optimal Q values, by breaking the time and space correlation structure of experiences as they are sampled from the real world, allowing for policy updates not dependent on the current time and location in the markov decision process. Moreover, using experience replay improves the efficiency of data usage, since every experience is used for learning more than once. This can be useful in situations where data acquirement is costly or difficult.

experience replay, iteration, state-action pair, (14 more...)

arXiv.org Artificial Intelligence

2112.04213

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

The Effect of Network Width on the Performance of Large-batch Training

Chen, Lingjiao, Wang, Hongyi, Zhao, Jinman, Papailiopoulos, Dimitris, Koutris, Paraschos

arXiv.org Machine LearningJun-10-2018

Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however, large batches can affect the convergence properties and generalization performance of SGD. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.

artificial intelligence, machine learning, neural network, (17 more...)

arXiv.org Machine Learning

1806.03791

Country: North America > United States > Wisconsin > Dane County > Madison (0.04)

Genre: Research Report (1.00)

Technology: