AITopics

2304.14907

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)

arXiv.org Artificial IntelligenceApr-28-2023

On Underdamped Nesterov's Acceleration

Chen, Shuo, Shi, Bin, Yuan, Ya-xiang

The high-resolution differential equation framework has been proven to be tailor-made for Nesterov's accelerated gradient descent method~(\texttt{NAG}) and its proximal correspondence -- the class of faster iterative shrinkage thresholding algorithms (FISTA). However, the systems of theories is not still complete, since the underdamped case ($r < 2$) has not been included. In this paper, based on the high-resolution differential equation framework, we construct the new Lyapunov functions for the underdamped case, which is motivated by the power of the time $t^{\gamma}$ or the iteration $k^{\gamma}$ in the mixed term. When the momentum parameter $r$ is $2$, the new Lyapunov functions are identical to the previous ones. These new proofs do not only include the convergence rate of the objective value previously obtained according to the low-resolution differential equation framework but also characterize the convergence rate of the minimal gradient norm square. All the convergence rates obtained for the underdamped case are continuously dependent on the parameter $r$. In addition, it is observed that the high-resolution differential equation approximately simulates the convergence behavior of~\texttt{NAG} for the critical case $r=-1$, while the low-resolution differential equation degenerates to the conservative Newton's equation. The high-resolution differential equation framework also theoretically characterizes the convergence rates, which are consistent with that obtained for the underdamped case with $r=-1$.

artificial intelligence, convergence rate, machine learning, (16 more...)

2304.14642

Country:

Asia > Middle East > Jordan (0.05)
Asia > China > Beijing > Beijing (0.04)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
(2 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Mlodozeniec, Bruno, Reisser, Matthias, Louizos, Christos

Hyperparameter Optimization through Neural Network Partitioning

arXiv.org Artificial IntelligenceApr-28-2023

Well-tuned hyperparameters are crucial for obtaining good generalization behavior in neural networks. They can enforce appropriate inductive biases, regularize the model and improve performance -- especially in the presence of limited data. In this work, we propose a simple and efficient way for optimizing hyperparameters inspired by the marginal likelihood, an optimization objective that requires no validation data. Each partition is associated with and optimized only on specific data shards. Combining these partitions into subnetworks allows us to define the "out-of-training-sample" loss of a subnetwork, i.e., the loss on data shards unseen by the subnetwork, as the objective for hyperparameter optimization. We demonstrate that we can apply this objective to optimize a variety of different hyperparameters in a single training run while being significantly computationally cheaper than alternative methods aiming to optimize the marginal likelihood for neural networks. Lastly, we also focus on optimizing hyperparameters in federated learning, where retraining and cross-validation are particularly challenging. Due to their remarkable generalization capabilities, deep neural networks have become the de-facto models for a wide range of complex tasks. Combining large models, large-enough datasets, and sufficient computing capabilities enable researchers to train powerful models through gradient descent. Regardless of the data regime, however, the choice of hyperparameters -- such as neural architecture, data augmentation strategies, regularization, or which optimizer to choose -- plays a crucial role in the final model's generalization capabilities. Hyperparameters allow encoding good inductive biases that effectively constrain the models' hypothesis space (e.g., convolutions for vision tasks), speed up learning, or prevent overfitting in the case of limited data. Whereas gradient descent enables the tuning of model parameters, accessing hyperparameter gradients is more complicated. This approach inherently requires training multiple models and consequently requires spending resources on models that will be discarded. Furthermore, traditional tuning requires a validation set since optimizing the hyperparameters on the training set alone cannot identify the right inductive biases. A canonical example is data augmentations -- they are not expected to improve training set performance, but they greatly help with generalization. In the low data regime, defining a validation set that cannot be used for tuning model parameters is undesirable. Picking the right amount of validation data is a hyperparameter in itself. The conventional rule of thumb to use 10% of all data can result in significant overfitting, as pointed out by Lorraine et al. (2019), when one has a sufficiently large number of hyperparameters to tune.

artificial intelligence, deep learning, machine learning, (18 more...)

2304.14766

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Virginia (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Roy, Abhishek, Mohapatra, Prasant

Fairness Uncertainty Quantification: How certain are you that the model is fair?

Fairness-aware machine learning has garnered significant attention in recent years because of extensive use of machine learning in sensitive applications like judiciary systems. Various heuristics, and optimization frameworks have been proposed to enforce fairness in classification \cite{del2020review} where the later approaches either provides empirical results or provides fairness guarantee for the exact minimizer of the objective function \cite{celis2019classification}. In modern machine learning, Stochastic Gradient Descent (SGD) type algorithms are almost always used as training algorithms implying that the learned model, and consequently, its fairness properties are random. Hence, especially for crucial applications, it is imperative to construct Confidence Interval (CI) for the fairness of the learned model. In this work we provide CI for test unfairness when a group-fairness-aware, specifically, Disparate Impact (DI), and Disparate Mistreatment (DM) aware linear binary classifier is trained using online SGD-type algorithms. We show that asymptotically a Central Limit Theorem holds for the estimated model parameter of both DI and DM-aware models. We provide online multiplier bootstrap method to estimate the asymptotic covariance to construct online CI. To do so, we extend the known theoretical guarantees shown on the consistency of the online bootstrap method for unconstrained SGD to constrained optimization which could be of independent interest. We illustrate our results on synthetic and real datasets.

algorithm, artificial intelligence, machine learning, (18 more...)

2304.1395

Country:

North America > United States > California > Yolo County > Davis (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Law (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Kunstner, Frederik, Chen, Jacques, Lavington, Jonathan Wilder, Schmidt, Mark

Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

The success of the Adam optimizer on a wide array of architectures has made it the default in settings where stochastic gradient descent (SGD) performs poorly. However, our theoretical understanding of this discrepancy is lagging, preventing the development of significant improvements on either algorithm. Recent work advances the hypothesis that Adam and other heuristics like gradient clipping outperform SGD on language tasks because the distribution of the error induced by sampling has heavy tails. This suggests that Adam outperform SGD because it uses a more robust gradient estimate. We evaluate this hypothesis by varying the batch size, up to the entire dataset, to control for stochasticity. We present evidence that stochasticity and heavy-tailed noise are not major factors in the performance gap between SGD and Adam. Rather, Adam performs better as the batch size increases, while SGD is less effective at taking advantage of the reduction in noise. This raises the question as to why Adam outperforms SGD in the full-batch setting. Through further investigation of simpler variants of SGD, we find that the behavior of Adam with large batches is similar to sign descent with momentum.

artificial intelligence, machine learning, natural language, (16 more...)

2304.1396

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > Canada > British Columbia (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Wiedmann, Marco, Hölle, Marc, Periyasamy, Maniraman, Meyer, Nico, Ufrecht, Christian, Scherer, Daniel D., Plinge, Axel, Mutschler, Christopher

An Empirical Comparison of Optimizers for Quantum Machine Learning with SPSA-based Gradients

VQA have attracted a lot of attention from the quantum computing community for the last few years. Their hybrid quantum-classical nature with relatively shallow quantum circuits makes them a promising platform for demonstrating the capabilities of NISQ devices. Although the classical machine learning community focuses on gradient-based parameter optimization, finding near-exact gradients for VQC with the parameter-shift rule introduces a large sampling overhead. Therefore, gradient-free optimizers have gained popularity in quantum machine learning circles. Among the most promising candidates is the SPSA algorithm, due to its low computational cost and inherent noise resilience. We introduce a novel approach that uses the approximated gradient from SPSA in combination with state-of-the-art gradient-based classical optimizers. We demonstrate numerically that this outperforms both standard SPSA and the parameter-shift rule in terms of convergence rate and absolute error in simple regression tasks. The improvement of our novel approach over SPSA with stochastic gradient decent is even amplified when shot- and hardware-noise are taken into account. We also demonstrate that error mitigation does not significantly affect our results.

artificial intelligence, machine learning, optimizer, (17 more...)

2305.00224

Country: Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Elsayed, Mohamed, Mahmood, A. Rupam

Utility-based Perturbed Gradient Descent: An Optimizer for Continual Learning

Modern representation learning methods often struggle to adapt quickly under non-stationarity because they suffer from catastrophic forgetting and decaying plasticity. Such problems prevent learners from fast adaptation since they may forget useful features or have difficulty learning new ones. Hence, these methods are rendered ineffective for continual learning. This paper proposes Utility-based Perturbed Gradient Descent (UPGD), an online learning algorithm well-suited for continual learning agents. UPGD protects useful weights or features from forgetting and perturbs less useful ones based on their utilities. Our empirical results show that UPGD helps reduce forgetting and maintain plasticity, enabling modern representation learning methods to work effectively in continual learning.

artificial intelligence, machine learning, upgd, (13 more...)

2302.03281

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)

Genre: Research Report > New Finding (0.87)

Industry: Education > Educational Setting (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.87)

arXiv.org Artificial IntelligenceApr-26-2023

Killing Two Birds with One Stone: Quantization Achieves Privacy in Distributed Learning

Yan, Guangfeng, Li, Tan, Wu, Kui, Song, Linqi

Communication efficiency and privacy protection are two critical issues in distributed machine learning. Existing methods tackle these two issues separately and may have a high implementation complexity that constrains their application in a resource-limited environment. We propose a comprehensive quantization-based solution that could simultaneously achieve communication efficiency and privacy protection, providing new insights into the correlated nature of communication and privacy. Specifically, we demonstrate the effectiveness of our proposed solutions in the distributed stochastic gradient descent (SGD) framework by adding binomial noise to the uniformly quantized gradients to reach the desired differential privacy level but with a minor sacrifice in communication efficiency. We theoretically capture the new trade-offs between communication, privacy, and learning performance.

artificial intelligence, machine learning, optimization problem, (16 more...)

2304.13545

Country: Asia > China > Hong Kong (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Velikanov, Maksim, Yarotsky, Dmitry

Tight Convergence Rate Bounds for Optimization Under Power Law Spectral Conditions

arXiv.org Artificial IntelligenceApr-26-2023

Performance of optimization on quadratic problems sensitively depends on the low-lying part of the spectrum. For large (effectively infinite-dimensional) problems, this part of the spectrum can often be naturally represented or approximated by power law distributions, resulting in power law convergence rates for iterative solutions of these problems by gradient-based algorithms. In this paper, we propose a new spectral condition providing tighter upper bounds for problems with power law optimization trajectories. We use this condition to build a complete picture of upper and lower bounds for a wide range of optimization algorithms -- Gradient Descent, Steepest Descent, Heavy Ball, and Conjugate Gradients -- with an emphasis on the underlying schedules of learning rate and momentum. In particular, we demonstrate how an optimally accelerated method, its schedule, and convergence upper bound can be obtained in a unified manner for a given shape of the spectrum. Also, we provide first proofs of tight lower bounds for convergence rates of Steepest Descent and Conjugate Gradients under spectral power laws with general exponents. Our experiments show that the obtained convergence bounds and acceleration strategies are not only relevant for exactly quadratic optimization problems, but also fairly accurate when applied to the training of neural networks.

artificial intelligence, machine learning, polynomial, (14 more...)

2202.00992

Country:

Asia > Russia (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > New York (0.04)
(5 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

arXiv.org Artificial IntelligenceApr-25-2023

Restarted Nonconvex Accelerated Gradient Descent: No More Polylogarithmic Factor in the $O(\epsilon^{-7/4})$ Complexity

Li, Huan, Lin, Zhouchen

This paper studies accelerated gradient methods for nonconvex optimization with Lipschitz continuous gradient and Hessian. We propose two simple accelerated gradient methods, restarted accelerated gradient descent (AGD) and restarted heavy ball (HB) method, and establish that our methods achieve an $\epsilon$-approximate first-order stationary point within $O(\epsilon^{-7/4})$ number of gradient evaluations by elementary proofs. Theoretically, our complexity does not hide any polylogarithmic factors, and thus it improves over the best known one by the $O(\log\frac{1}{\epsilon})$ factor. Our algorithms are simple in the sense that they only consist of Nesterov's classical AGD or Polyak's HB iterations, as well as a restart mechanism. They do not invoke negative curvature exploitation or minimization of regularized surrogate functions as the subroutines. In contrast with existing analysis, our elementary proofs use less advanced techniques and do not invoke the analysis of strongly convex AGD or HB. Code is avaliable at https://github.com/lihuanML/RestartAGD.

artificial intelligence, iteration, machine learning, (15 more...)

2201.11411

Country:

Asia > Middle East > Jordan (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)
(2 more...)

Genre: Research Report (0.70)

Industry: Leisure & Entertainment (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.73)