AITopics

2305.01588

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(6 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

arXiv.org Machine LearningNov-9-2023

Dataset Distillation with Convexified Implicit Gradients

Loo, Noel, Hasani, Ramin, Lechner, Mathias, Rus, Daniela

We propose a new dataset distillation algorithm using reparameterization and convexification of implicit gradients (RCIG), that substantially improves the state-of-the-art. To this end, we first formulate dataset distillation as a bi-level optimization problem. Then, we show how implicit gradients can be effectively used to compute meta-gradient updates. We further equip the algorithm with a convexified approximation that corresponds to learning on top of a frozen finite-width neural tangent kernel. Finally, we improve bias in implicit gradients by parameterizing the neural network to enable analytical computation of final-layer parameters given the body parameters. RCIG establishes the new state-of-the-art on a diverse series of dataset distillation tasks. Notably, with one image per class, on resized ImageNet, RCIG sees on average a 108\% improvement over the previous state-of-the-art distillation algorithm. Similarly, we observed a 66\% gain over SOTA on Tiny-ImageNet and 37\% on CIFAR-100.

artificial intelligence, deep learning, machine learning, (17 more...)

2302.06755

Country:

Pacific Ocean > South Pacific Ocean > Coral Sea (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > New York > New York County > New York City (0.04)
(8 more...)

Genre: Research Report (1.00)

Industry:

Transportation > Ground > Road (1.00)
Transportation > Passenger (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

George, Erin, Murray, Michael, Swartworth, William, Needell, Deanna

Training shallow ReLU networks on noisy data using hinge loss: when do we overfit and is it benign?

arXiv.org Artificial IntelligenceNov-8-2023

We study benign overfitting in two-layer ReLU networks trained using gradient descent and hinge loss on noisy data for binary classification. In particular, we consider linearly separable data for which a relatively small proportion of labels are corrupted or flipped. We identify conditions on the margin of the clean data that give rise to three distinct training outcomes: benign overfitting, in which zero loss is achieved and with high probability test data is classified correctly; non-benign overfitting, in which zero loss is achieved but test data is misclassified with probability lower bounded by a constant; and non-overfitting, in which clean points, but not corrupt points, achieve zero loss and again with high probability test data is classified correctly. Our analysis provides a fine-grained description of the dynamics of neurons throughout training and reveals two distinct phases: in the first phase clean points achieve close to zero loss, in the second phase clean points oscillate on the boundary of zero loss while corrupt points either converge towards zero loss or are eventually zeroed by the network. We prove these results using a combinatorial approach that involves bounding the number of clean versus corrupt updates across these phases of training.

neuron, probability, training sample, (15 more...)

2306.09955

Country:

North America > United States (0.14)
Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Shahriari-Mehr, Firooz, Panahi, Ashkan

Double Averaging and Gradient Projection: Convergence Guarantees for Decentralized Constrained Optimization

arXiv.org Artificial IntelligenceNov-8-2023

We consider a generic decentralized constrained optimization problem over static, directed communication networks, where each agent has exclusive access to only one convex, differentiable, local objective term and one convex constraint set. For this setup, we propose a novel decentralized algorithm, called DAGP (Double Averaging and Gradient Projection), based on local gradients, projection onto local constraints, and local averaging. We achieve global optimality through a novel distributed tracking technique we call distributed null projection. Further, we show that DAGP can be used to solve unconstrained problems with non-differentiable objective terms with a problem reduction scheme. Assuming only smoothness of the objective terms, we study the convergence of DAGP and establish sub-linear rates of convergence in terms of feasibility, consensus, and optimality, with no extra assumption (e.g. strong convexity). For the analysis, we forego the difficulties of selecting Lyapunov functions by proposing a new methodology of convergence analysis in optimization problems, which we refer to as aggregate lower-bounding. To demonstrate the generality of this method, we also provide an alternative convergence proof for the standard gradient descent algorithm with smooth functions. Finally, we present numerical results demonstrating the effectiveness of our proposed method in both constrained and unconstrained problems. In particular, we propose a distributed scheme by DAGP for the optimal transport problem with superior performance and speed.

algorithm, constraint, optimization, (15 more...)

2210.03232

Country:

North America > United States > Texas > Travis County > Austin (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Busbridge, Dan, Ramapuram, Jason, Ablin, Pierre, Likhomanenko, Tatiana, Dhekane, Eeshan Gunesh, Suau, Xavier, Webb, Russ

How to Scale Your EMA

arXiv.org Machine LearningNov-7-2023

Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model, whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum hyperparameter. This model EMA can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics across batch sizes and lower model performance. In this work, we provide a scaling rule for optimization in the presence of a model EMA and demonstrate the rule's validity across a range of architectures, optimizers, and data modalities. We also show the rule's validity where the model EMA contributes to the optimization of the target model, enabling us to train EMA-based pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we enable training of BYOL up to batch size 24,576 without sacrificing performance, a 6$\times$ wall-clock time reduction under idealized hardware settings.

batch size, ema scaling rule, scaling rule, (14 more...)

2307.13813

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(15 more...)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Puchkin, Nikita, Gorbunov, Eduard, Kutuzov, Nikolay, Gasnikov, Alexander

Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems

arXiv.org Artificial IntelligenceNov-7-2023

We consider stochastic optimization problems with heavy-tailed noise with structured density. For such problems, we show that it is possible to get faster rates of convergence than $\mathcal{O}(K^{-2(\alpha - 1)/\alpha})$, when the stochastic gradients have finite moments of order $\alpha \in (1, 2]$. In particular, our analysis allows the noise norm to have an unbounded expectation. To achieve these results, we stabilize stochastic gradients, using smoothed medians of means. We prove that the resulting estimates have negligible bias and controllable variance. This allows us to carefully incorporate them into clipped-SGD and clipped-SSTM and derive new high-probability complexity bounds in the considered setup.

exp, ln 4, mnull 1, (15 more...)

2311.04161

Country:

Asia > Russia (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Asia > Middle East > UAE (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

arXiv.org Machine LearningNov-7-2023

Loss Dynamics of Temporal Difference Reinforcement Learning

Bordelon, Blake, Masset, Paul, Kuo, Henry, Pehlevan, Cengiz

Reinforcement learning has been successful across several applications in which agents have to learn to act in environments with sparse feedback. However, despite this empirical success there is still a lack of theoretical understanding of how the parameters of reinforcement learning models and the features used to represent states interact to control the dynamics of learning. In this work, we use concepts from statistical physics, to study the typical case learning curves for temporal difference learning of a value function with linear function approximators. Our theory is derived under a Gaussian equivalence hypothesis where averages over the random trajectories are replaced with temporally correlated Gaussian feature averages and we validate our assumptions on small scale Markov Decision Processes. We find that the stochastic semi-gradient noise due to subsampling the space of possible episodes leads to significant plateaus in the value error, unlike in traditional gradient descent dynamics. We study how learning dynamics and plateaus depend on feature structure, learning rate, discount factor, and reward function. We then analyze how strategies like learning rate annealing and reward shaping can favorably alter learning dynamics and plateaus. To conclude, our work introduces new tools to open a new direction towards developing a theory of learning dynamics in reinforcement learning.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2307.04841

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.46)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

arXiv.org Machine LearningNov-7-2023

Spectral Evolution and Invariance in Linear-width Neural Networks

Wang, Zhichao, Engel, Andrew, Sarwate, Anand, Dumitriu, Ioana, Chiang, Tony

We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invariance of the bulk spectra for both conjugate and neural tangent kernels. We demonstrate similar characteristics when training with stochastic gradient descent with small learning rates. When the learning rate is large, we exhibit the emergence of an outlier whose corresponding eigenvector is aligned with the training data structure. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel matrices exhibit heavy tail behavior. Simple examples are provided to explain when heavy tails can have better generalizations. We exhibit different spectral properties such as invariant bulk, spike, and heavy-tailed distribution from a two-layer neural network using different training strategies, and then correlate them to the feature learning. Analogous phenomena also appear when we train conventional neural networks with real-world data. We conclude that monitoring the evolution of the spectra during training is an essential step toward understanding the training dynamics and feature learning.

artificial intelligence, machine learning, spectra, (16 more...)

2211.06506

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
North America > United States > Texas (0.04)
North America > United States > New Jersey (0.04)
(2 more...)

Genre: Research Report (0.81)

Industry:

Government > Regional Government > North America Government > United States Government (0.46)
Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)

Poole, William, Ouldridge, Thomas E., Gopalkrishnan, Manoj

Autonomous Learning of Generative Models with Chemical Reaction Network Ensembles

arXiv.org Artificial IntelligenceNov-6-2023

Can a micron sized sack of interacting molecules autonomously learn an internal model of a complex and fluctuating environment? We draw insights from control theory, machine learning theory, chemical reaction network theory, and statistical physics to develop a general architecture whereby a broad class of chemical systems can autonomously learn complex distributions. Our construction takes the form of a chemical implementation of machine learning's optimization workhorse: gradient descent on the relative entropy cost function. We show how this method can be applied to optimize any detailed balanced chemical reaction network and that the construction is capable of using hidden units to learn complex distributions. This result is then recast as a form of integral feedback control. Finally, due to our use of an explicit physical model of learning, we are able to derive thermodynamic costs and trade-offs associated to this process.

autonomous learning, dbcrn, reaction, (13 more...)

2311.00975

Country:

North America > United States > California > Los Angeles County > Pasadena (0.14)
Asia > India > Maharashtra > Mumbai (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(4 more...)

Genre: Research Report (0.63)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

arXiv.org Machine LearningNov-6-2023

Parameter-Agnostic Optimization under Relaxed Smoothness

Hübler, Florian, Yang, Junchi, Li, Xiang, He, Niao

Tuning hyperparameters, such as the stepsize, presents a major challenge of training machine learning models. To address this challenge, numerous adaptive optimization algorithms have been developed that achieve near-optimal complexities, even when stepsizes are independent of problem-specific parameters, provided that the loss function is $L$-smooth. However, as the assumption is relaxed to the more realistic $(L_0, L_1)$-smoothness, all existing convergence results still necessitate tuning of the stepsize. In this study, we demonstrate that Normalized Stochastic Gradient Descent with Momentum (NSGD-M) can achieve a (nearly) rate-optimal complexity without prior knowledge of any problem parameter, though this comes at the cost of introducing an exponential term dependent on $L_1$ in the complexity. We further establish that this exponential term is inevitable to such schemes by introducing a theoretical framework of lower bounds tailored explicitly for parameter-agnostic algorithms. Interestingly, in deterministic settings, the exponential factor can be neutralized by employing Gradient Descent with a Backtracking Line Search. To the best of our knowledge, these findings represent the first parameter-agnostic convergence results under the generalized smoothness condition. Our empirical experiments further confirm our theoretical insights.

algorithm, artificial intelligence, machine learning, (17 more...)

2311.03252

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > China (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)