AITopics

arXiv.org Machine LearningJun-4-2024

Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs

Arnaboldi, Luca, Dandi, Yatin, Krzakala, Florent, Loureiro, Bruno, Pesce, Luca, Stephan, Ludovic

We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned \citep{arous2021online} and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, \textit{Correlation loss SGD}, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.

batch size, correlation loss sgd, sgd, (12 more...)

arXiv.org Machine Learning

2406.02157

Country:

North America > United States (0.14)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.05)
Europe > Switzerland > Vaud > Lausanne (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Education > Educational Setting > Online (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

Contextual Dynamic Pricing: Algorithms, Optimality, and Local Differential Privacy Constraints

Zhao, Zifeng, Jiang, Feiyu, Yu, Yi

We study the contextual dynamic pricing problem where a firm sells products to $T$ sequentially arriving consumers that behave according to an unknown demand model. The firm aims to maximize its revenue, i.e. minimize its regret over a clairvoyant that knows the model in advance. The demand model is a generalized linear model (GLM), allowing for a stochastic feature vector in $\mathbb R^d$ that encodes product and consumer information. We first show that the optimal regret upper bound is of order $\sqrt{dT}$, up to a logarithmic factor, improving upon existing upper bounds in the literature by a $\sqrt{d}$ factor. This sharper rate is materialised by two algorithms: a confidence bound-type (supCB) algorithm and an explore-then-commit (ETC) algorithm. A key insight of our theoretical result is an intrinsic connection between dynamic pricing and the contextual multi-armed bandit problem with many arms based on a careful discretization. We further study contextual dynamic pricing under the local differential privacy (LDP) constraints. In particular, we propose a stochastic gradient descent based ETC algorithm that achieves an optimal regret upper bound of order $d\sqrt{T}/\epsilon$, up to a logarithmic factor, where $\epsilon>0$ is the privacy parameter. The regret upper bounds with and without LDP constraints are accompanied by newly constructed minimax lower bounds, which further characterize the cost of privacy. Extensive numerical experiments and a real data application on online lending are conducted to illustrate the efficiency and practical value of the proposed algorithms in dynamic pricing.

algorithm, assumption 2, dynamic pricing, (16 more...)

2406.02424

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Banking & Finance (0.67)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

DPDR: Gradient Decomposition and Reconstruction for Differentially Private Deep Learning

Liu, Yixuan, Xiong, Li, Liu, Yuhan, Gu, Yujie, Liu, Ruixuan, Chen, Hong

Differentially Private Stochastic Gradients Descent (DP-SGD) is a prominent paradigm for preserving privacy in deep learning. It ensures privacy by perturbing gradients with random noise calibrated to their entire norm at each training step. However, this perturbation suffers from a sub-optimal performance: it repeatedly wastes privacy budget on the general converging direction shared among gradients from different batches, which we refer as common knowledge, yet yields little information gain. Motivated by this, we propose a differentially private training framework with early gradient decomposition and reconstruction (DPDR), which enables more efficient use of the privacy budget. In essence, it boosts model utility by focusing on incremental information protection and recycling the privatized common knowledge learned from previous gradients at early training steps. Concretely, DPDR incorporates three steps. First, it disentangles common knowledge and incremental information in current gradients by decomposing them based on previous noisy gradients. Second, most privacy budget is spent on protecting incremental information for higher information gain. Third, the model is updated with the gradient reconstructed from recycled common knowledge and noisy incremental information. Theoretical analysis and extensive experiments show that DPDR outperforms state-of-the-art baselines on both convergence rate and accuracy.

dpdr, gradient, knowledge, (11 more...)

2406.02744

Country:

Asia > China (0.04)
North America > United States > Virginia (0.04)
Europe > Spain > Andalusia > Granada Province > Granada (0.04)
Asia > Japan > Kyūshū & Okinawa > Kyūshū (0.04)

Genre:

Research Report (0.64)
Workflow (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding

Li, Hongkang, Wang, Meng, Ma, Tengfei, Liu, Sijia, Zhang, Zaixi, Chen, Pin-Yu

Graph Transformers, which incorporate self-attention and positional encoding, have recently emerged as a powerful architecture for various graph learning tasks. Despite their impressive performance, the complex non-convex interactions across layers and the recursive graph structure have made it challenging to establish a theoretical foundation for learning and generalization. This study introduces the first theoretical investigation of a shallow Graph Transformer for semi-supervised node classification, comprising a self-attention layer with relative positional encoding and a two-layer perceptron. Focusing on a graph data model with discriminative nodes that determine node labels and non-discriminative nodes that are class-irrelevant, we characterize the sample complexity required to achieve a desirable generalization error by training with stochastic gradient descent (SGD). This paper provides the quantitative characterization of the sample complexity and number of iterations for convergence dependent on the fraction of discriminative nodes, the dominant patterns, and the initial model errors. Furthermore, we demonstrate that self-attention and positional encoding enhance generalization by making the attention map sparse and promoting the core neighborhood during training, which explains the superior feature representation of Graph Transformers. Our theoretical results are supported by empirical experiments on synthetic and real-world benchmarks.

graph transformer, node, theoretical dive, (12 more...)

2406.01977

Country:

Europe > Austria > Vienna (0.14)
North America > United States > New York > Suffolk County > Stony Brook (0.04)
North America > United States > Wisconsin (0.04)
(5 more...)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Choquette-Choo, Christopher A., Ganesh, Arun, Thakurta, Abhradeep

Optimal Rates for DP-SCO with a Single Epoch and Large Batches

The most common algorithms for differentially private (DP) machine learning (ML) are all based on stochastic gradient descent, for example, DP-SGD. These algorithms achieve DP by treating each gradient as an independent private query. However, this independence can cause us to overpay in privacy loss because we don't analyze the entire gradient trajectory. In this work, we propose a new DP algorithm, which we call Accelerated-DP-SRGD (DP stochastic recursive gradient descent), that enables us to break this independence and only pay for privacy in the gradient difference, i.e., in the new information at the current step. Our algorithm achieves the optimal DP-stochastic convex optimization (DP-SCO) error (up to polylog factors) using only a single epoch over the dataset, and converges at the Nesterov's accelerated rate. Our algorithm can be run in at most $\sqrt{n}$ batch gradient steps with batch size at least $\sqrt{n}$, unlike prior work which required $O(n)$ queries with mostly constant batch sizes. To achieve this, our algorithm combines three key ingredients, a variant of stochastic recursive gradients (SRG), accelerated gradient descent, and correlated noise generation from DP continual counting. Finally, we also show that our algorithm improves over existing SoTA on multi-class logistic regression on MNIST and CIFAR-10.

algorithm, gradient, theorem 4, (13 more...)

2406.02716

Country: Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

A KL-based Analysis Framework with Applications to Non-Descent Optimization Methods

Qiu, Junwen, Ma, Bohao, Li, Xiao, Milzarek, Andre

We propose a novel analysis framework for non-descent-type optimization methodologies in nonconvex scenarios based on the Kurdyka-Lojasiewicz property. Our framework allows covering a broad class of algorithms, including those commonly employed in stochastic and distributed optimization. Specifically, it enables the analysis of first-order methods that lack a sufficient descent property and do not require access to full (deterministic) gradient information. We leverage this framework to establish, for the first time, iterate convergence and the corresponding rates for the decentralized gradient method and federated averaging under mild assumptions. Furthermore, based on the new analysis techniques, we show the convergence of the random reshuffling and stochastic gradient descent method without necessitating typical a priori bounded iterates assumptions.

application, kl-based analysis framework, non-descent optimization method

2406.02273

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.53)

CADE: Cosine Annealing Differential Evolution for Spiking Neural Network

Jiang, Runhua, Du, Guodong, Yu, Shuyang, Guo, Yifei, Goh, Sim Kuan, Tang, Ho-Kin

Spiking neural networks (SNNs) have gained prominence for their potential in neuromorphic computing and energy-efficient artificial intelligence, yet optimizing them remains a formidable challenge for gradient-based methods due to their discrete, spike-based computation. This paper attempts to tackle the challenges by introducing Cosine Annealing Differential Evolution (CADE), designed to modulate the mutation factor (F) and crossover rate (CR) of differential evolution (DE) for the SNN model, i.e., Spiking Element Wise (SEW) ResNet. Extensive empirical evaluations were conducted to analyze CADE. CADE showed a balance in exploring and exploiting the search space, resulting in accelerated convergence and improved accuracy compared to existing gradient-based and DE-based methods. Moreover, an initialization method based on a transfer learning setting was developed, pretraining on a source dataset (i.e., CIFAR-10) and fine-tuning the target dataset (i.e., CIFAR-100), to improve population diversity. It was found to further enhance CADE for SNN. Remarkably, CADE elevates the performance of the highest accuracy SEW model by an additional 0.52 percentage points, underscoring its effectiveness in fine-tuning and enhancing SNNs. These findings emphasize the pivotal role of a scheduler for F and CR adjustment, especially for DE-based SNN. Source Code on Github: https://github.com/Tank-Jiang/CADE4SNN.

algorithm, neural network, snn, (15 more...)

2406.02349

Country:

Asia > Malaysia (0.05)
Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Fujian Province > Xiamen (0.04)
(4 more...)

Genre:

Research Report > New Finding (0.68)
Research Report > Promising Solution (0.46)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Warren, Houston, Oliveira, Rafael, Ramos, Fabio

Stein Random Feature Regression

arXiv.org Machine LearningJun-4-2024

In large-scale regression problems, random Fourier features (RFFs) have significantly enhanced the computational scalability and flexibility of Gaussian processes (GPs) by defining kernels through their spectral density, from which a finite set of Monte Carlo samples can be used to form an approximate low-rank GP. However, the efficacy of RFFs in kernel approximation and Bayesian kernel learning depends on the ability to tractably sample the kernel spectral measure and the quality of the generated samples. We introduce Stein random features (SRF), leveraging Stein variational gradient descent, which can be used to both generate high-quality RFF samples of known spectral densities as well as flexibly and efficiently approximate traditionally non-analytical spectral measure posteriors. SRFs require only the evaluation of log-probability gradients to perform both kernel approximation and Bayesian kernel learning that results in superior performance over traditional approaches. We empirically validate the effectiveness of SRFs by comparing them to baselines on kernel approximation and well-known GP regression problems.

inference, kernel, m-srfr, (16 more...)

arXiv.org Machine Learning

2406.00438

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(4 more...)

Genre: Research Report (0.64)

Industry: Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Vaillant, Maxime, Jeannerot, Alix, Gorce, Jean-Marie

Joint Constellation Shaping Using Gradient Descent Approach for MU-MIMO Broadcast Channel

arXiv.org Artificial IntelligenceJun-3-2024

We introduce a learning-based approach to optimize a joint constellation for a multi-user MIMO broadcast channel ($T$ Tx antennas, $K$ users, each with $R$ Rx antennas), with perfect channel knowledge. The aim of the optimizer (MAX-MIN) is to maximize the minimum mutual information between the transmitter and each receiver, under a sum-power constraint. The proposed optimization method do neither impose the transmitter to use superposition coding (SC) or any other linear precoding, nor to use successive interference cancellation (SIC) at the receiver. Instead, the approach designs a joint constellation, optimized such that its projection into the subspace of each receiver $k$, maximizes the minimum mutual information $I(W_k;Y_k)$ between each transmitted binary input $W_k$ and the output signal at the intended receiver $Y_k$. The rates obtained by our method are compared to those achieved with linear precoders.

gradient descent approach, joint constellation shaping, mu-mimo broadcast channel

2407.07708

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)