AITopics

2404.09391

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.55)

Campos, Cédric M., de Diego, David Martín, Torrente, José

Momentum-based gradient descent methods for Lie groups

arXiv.org Artificial IntelligenceApr-14-2024

Classical Momentum, and Nesterov's Accelerated Gradient (NAG; Nesterov, 1983) are well know examples of momentum-descent methods for optimization. While the latter outperforms the former, solely generalizations of PHB-like methods to nonlinear spaces have been described in the literature. We propose here a generalization of NAG-like methods for Lie group optimization based on the variational one-to-one correspondence between classical and accelerated momentum methods (Campos et al., 2023).

equation, lie group, momentum-based gradient descent method, (12 more...)

2404.09363

Country:

North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.42)

Jiang, Xiaowen, Rodomanov, Anton, Stich, Sebastian U.

Federated Optimization with Doubly Regularized Drift Correction

arXiv.org Artificial IntelligenceApr-12-2024

Federated learning is a distributed optimization paradigm that allows training machine learning models across decentralized devices while keeping the data localized. The standard method, FedAvg, suffers from client drift which can hamper performance and increase communication costs over centralized methods. Previous works proposed various strategies to mitigate drift, yet none have shown uniformly improved communication-computation trade-offs over vanilla gradient descent. In this work, we revisit DANE, an established method in distributed optimization. We show that (i) DANE can achieve the desired communication reduction under Hessian similarity constraints. Furthermore, (ii) we present an extension, DANE+, which supports arbitrary inexact local solvers and has more freedom to choose how to aggregate the local updates. We propose (iii) a novel method, FedRed, which has improved local computational complexity and retains the same communication complexity compared to DANE/DANE+. This is achieved by using doubly regularized drift correction.

communication round, control variate, federated optimization, (6 more...)

2404.08447

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Germany > Saarland > Saarbrücken (0.04)
North America > United States > Virginia (0.04)
(2 more...)

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Bardone, Lorenzo, Goldt, Sebastian

Sliding down the stairs: how correlated latent variables accelerate learning with neural networks

arXiv.org Machine LearningApr-12-2024

Neural networks extract features from data using stochastic gradient descent (SGD). In particular, higher-order input cumulants (HOCs) are crucial for their performance. However, extracting information from the $p$th cumulant of $d$-dimensional inputs is computationally hard: the number of samples required to recover a single direction from an order-$p$ tensor (tensor PCA) using online SGD grows as $d^{p-1}$, which is prohibitive for high-dimensional inputs. This result raises the question of how neural networks extract relevant directions from the HOCs of their inputs efficiently. Here, we show that correlations between latent variables along the directions encoded in different input cumulants speed up learning from higher-order correlations. We show this effect analytically by deriving nearly sharp thresholds for the number of samples required by a single neuron to weakly-recover these directions using online SGD from a random start in high dimensions. Our analytical results are confirmed in simulations of two-layer neural networks and unveil a new mechanism for hierarchical learning in neural networks.

latent variable, neural network, spike, (15 more...)

2404.08602

Country:

North America (0.14)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.05)
Europe > Italy > Friuli Venezia Giulia > Trieste Province > Trieste (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

arXiv.org Artificial IntelligenceApr-11-2024

Demystifying Why Local Aggregation Helps: Convergence Analysis of Hierarchical SGD

Wang, Jiayi, Wang, Shiqiang, Chen, Rong-Rong, Ji, Mingyue

Hierarchical SGD (H-SGD) has emerged as a new distributed SGD algorithm for multi-level communication networks. In H-SGD, before each global aggregation, workers send their updated local models to local servers for aggregations. Despite recent research efforts, the effect of local aggregation on global convergence still lacks theoretical understanding. In this work, we first introduce a new notion of "upward" and "downward" divergences. We then use it to conduct a novel analysis to obtain a worst-case convergence upper bound for two-level H-SGD with non-IID data, non-convex objective function, and stochastic gradient. By extending this result to the case with random grouping, we observe that this convergence upper bound of H-SGD is between the upper bounds of two single-level local SGD settings, with the number of local iterations equal to the local and global update periods in H-SGD, respectively. We refer to this as the "sandwich behavior". Furthermore, we extend our analytical approach based on "upward" and "downward" divergences to study the convergence for the general case of H-SGD with more than two levels, where the "sandwich behavior" still holds. Our theoretical results provide key insights of why local aggregation can be beneficial in improving the convergence of H-SGD.

aggregation, divergence, h-sgd, (16 more...)

2010.12998

Country:

North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.82)

Industry: Information Technology (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Communications > Networks (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

arXiv.org Artificial IntelligenceApr-10-2024

The Sample Complexity of Gradient Descent in Stochastic Convex Optimization

Livni, Roi

Stochastic Convex Optimization (SCO) is a theoretical model that depicts a learner that minimizes a (Lipschitz) convex function, given finite noisy observations of the objective [22]. While often considered simplistic, in recent years SCO has become a focus of theoretical research, partly, because of its importance to the study of first-order optimization methods. But, also, it has become focus of study because it is one of few theoretical settings that exhibit overparameterized learning. In more detail, classical learning theory often focuses on the tension between number of samples, or training data, and the complexity of the model to be learnt. A common wisdom of classical theories [1, 7, 14, 24] is that, to avoid overfitting, the complexity of a model should be adjusted in proportion to the amount of training data. However, recent advances in Machine Learning have challenged this viewpoint. Evidently [18, 25], state-of-the-art algorithms generalize well but without, explicitly, controlling the capacity of the model to be learnt. In turn, today, it is one of the most emerging challenges, for learning theory, to understand learnability when the number of parameters in a learnt model exceeds the number of examples, and when, seemingly, nothing withholds the algorithm from overfitting. Towards this, we look at SCO.

algorithm, oracle, sample complexity, (13 more...)

2404.04931

Country: Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)

arXiv.org Machine LearningApr-10-2024

Gradient Descent is Pareto-Optimal in the Oracle Complexity and Memory Tradeoff for Feasibility Problems

Blanchard, Moise

In this paper we provide oracle complexity lower bounds for finding a point in a given set using a memory-constrained algorithm that has access to a separation oracle. We assume that the set is contained within the unit $d$-dimensional ball and contains a ball of known radius $\epsilon>0$. This setup is commonly referred to as the feasibility problem. We show that to solve feasibility problems with accuracy $\epsilon \geq e^{-d^{o(1)}}$, any deterministic algorithm either uses $d^{1+\delta}$ bits of memory or must make at least $1/(d^{0.01\delta }\epsilon^{2\frac{1-\delta}{1+1.01 \delta}-o(1)})$ oracle queries, for any $\delta\in[0,1]$. Additionally, we show that randomized algorithms either use $d^{1+\delta}$ memory or make at least $1/(d^{2\delta} \epsilon^{2(1-4\delta)-o(1)})$ queries for any $\delta\in[0,\frac{1}{4}]$. Because gradient descent only uses linear memory $\mathcal O(d\ln 1/\epsilon)$ but makes $\Omega(1/\epsilon^2)$ queries, our results imply that it is Pareto-optimal in the oracle complexity/memory tradeoff. Further, our results show that the oracle complexity for deterministic algorithms is always polynomial in $1/\epsilon$ if the algorithm has less than quadratic memory in $d$. This reveals a sharp phase transition since with quadratic $\mathcal O(d^2 \ln1/\epsilon)$ memory, cutting plane methods only require $\mathcal O(d\ln 1/\epsilon)$ queries.

exploratory query, query, subspace, (16 more...)

2404.0672

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
North America > United States > Virginia (0.04)
(4 more...)

Genre: Research Report > New Finding (0.74)

Industry: Leisure & Entertainment > Games (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)

Hu, Yaxi, Sanyal, Amartya, Schölkopf, Bernhard

Provable Privacy with Non-Private Pre-Processing

arXiv.org Machine LearningApr-10-2024

When analysing Differentially Private (DP) machine learning pipelines, the potential privacy cost of data-dependent pre-processing is frequently overlooked in privacy accounting. In this work, we propose a general framework to evaluate the additional privacy cost incurred by non-private data-dependent pre-processing algorithms. Our framework establishes upper bounds on the overall privacy guarantees by utilising two new technical notions: a variant of DP termed Smooth DP and the bounded sensitivity of the pre-processing algorithms. In addition to the generic framework, we provide explicit overall privacy guarantees for multiple data-dependent pre-processing algorithms, such as data imputation, quantization, deduplication and PCA, when used in combination with several DP algorithms. Notably, this framework is also simple to implement, allowing direct integration into existing DP pipelines.

algorithm, dataset, sensitivity, (15 more...)

2403.13041

Country:

North America > United States (0.28)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Chen, Haotian, Kuzina, Anna, Esmaeili, Babak, Tomczak, Jakub M

Variational Stochastic Gradient Descent for Deep Neural Networks

arXiv.org Machine LearningApr-9-2024

Optimizing deep neural networks is one of the main tasks in successful deep learning. Current state-of-the-art optimizers are adaptive gradient-based optimization methods such as Adam. Recently, there has been an increasing interest in formulating gradient-based optimizers in a probabilistic framework for better estimation of gradients and modeling uncertainties. Here, we propose to combine both approaches, resulting in the Variational Stochastic Gradient Descent (VSGD) optimizer. We model gradient updates as a probabilistic model and utilize stochastic variational inference (SVI) to derive an efficient and effective update rule. Further, we show how our VSGD method relates to other adaptive gradient-based optimizers like Adam. Lastly, we carry out experiments on two image classification datasets and four deep neural network architectures, where we show that VSGD outperforms Adam and SGD.

optimizer, variational stochastic gradient descent, vsgd, (10 more...)

2404.06549

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > Netherlands > North Brabant > Eindhoven (0.04)
North America > United States (0.04)
(2 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Ma, Hao, Zeilinger, Melanie, Muehlebach, Michael

Stochastic Online Optimization for Cyber-Physical and Robotic Systems

arXiv.org Artificial IntelligenceApr-8-2024

We propose a novel gradient-based online optimization framework for solving stochastic programming problems that frequently arise in the context of cyber-physical and robotic systems. Our problem formulation accommodates constraints that model the evolution of a cyber-physical system, which has, in general, a continuous state and action space, is nonlinear, and where the state is only partially observed. We also incorporate an approximate model of the dynamics as prior knowledge into the learning process and show that even rough estimates of the dynamics can significantly improve the convergence of our algorithms. Our online optimization framework encompasses both gradient descent and quasi-Newton methods, and we provide a unified convergence analysis of our algorithms in a non-convex setting. We also characterize the impact of modeling errors in the system dynamics on the convergence rate of the algorithms. Finally, we evaluate our algorithms in simulations of a flexible beam, a four-legged walking robot, and in real-world experiments with a ping-pong playing robot.

algorithm, optimization, trajectory, (16 more...)

2404.05318

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Energy (0.93)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
(2 more...)