AITopics

2406.10455

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > California > Yolo County > Davis (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
(2 more...)

arXiv.org Artificial IntelligenceJun-14-2024

The duality structure gradient descent algorithm: analysis and applications to neural networks

Flynn, Thomas

The training of machine learning models is typically carried out using some form of gradient descent, often with great success. However, non-asymptotic analyses of first-order optimization algorithms typically employ a gradient smoothness assumption (formally, Lipschitz continuity of the gradient) that is too strong to be applicable in the case of deep neural networks. To address this, we propose an algorithm named duality structure gradient descent (DSGD) that is amenable to non-asymptotic performance analysis, under mild assumptions on the training set and network architecture. The algorithm can be viewed as a form of layer-wise coordinate descent, where at each iteration the algorithm chooses one layer of the network to update. The decision of what layer to update is done in a greedy fashion, based on a rigorous lower bound on the improvement of the objective function for each choice of layer. In the analysis, we bound the time required to reach approximate stationary points, in both the deterministic and stochastic settings. The convergence is measured in terms of a parameter-dependent family of norms that is derived from the network architecture and designed to confirm a smoothness-like property on the gradient of the training loss function. We empirically demonstrate the behavior of DSGD in several neural network training scenarios.

algorithm, derivative, gradient descent, (15 more...)

1708.00523

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(3 more...)

Genre: Research Report (0.81)

Industry: Education > Educational Technology > Educational Software > Computer Based Training (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningJun-14-2024

Improved Particle Approximation Error for Mean Field Neural Networks

Nitanda, Atsushi

Mean-field Langevin dynamics (MFLD) minimizes an entropy-regularized nonlinear convex functional defined over the space of probability distributions. MFLD has gained attention due to its connection with noisy gradient descent for mean-field two-layer neural networks. Unlike standard Langevin dynamics, the nonlinearity of the objective functional induces particle interactions, necessitating multiple particles to approximate the dynamics in a finite-particle setting. Recent works (Chen et al., 2022; Suzuki et al., 2023b) have demonstrated the uniform-in-time propagation of chaos for MFLD, showing that the gap between the particle system and its mean-field limit uniformly shrinks over time as the number of particles increases. In this work, we improve the dependence on logarithmic Sobolev inequality (LSI) constants in their particle approximation errors, which can exponentially deteriorate with the regularization coefficient. Specifically, we establish an LSI-constant-free particle approximation error concerning the objective gap by leveraging the problem structure in risk minimization. As the application, we demonstrate improved convergence of MFLD, sampling guarantee for the mean-field stationary distribution, and uniform-in-time Wasserstein propagation of chaos in terms of particle complexity.

approximation error, inequality, mfld, (16 more...)

2405.15767

Country: Asia > China (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Bonet, Clément, Uscidda, Théo, David, Adam, Aubin-Frankowski, Pierre-Cyril, Korba, Anna

Mirror and Preconditioned Gradient Descent in Wasserstein Space

arXiv.org Artificial IntelligenceJun-13-2024

As the problem of minimizing functionals on the Wasserstein space encompasses many applications in machine learning, different optimization algorithms on $\mathbb{R}^d$ have received their counterpart analog on the Wasserstein space. We focus here on lifting two explicit algorithms: mirror descent and preconditioned gradient descent. These algorithms have been introduced to better capture the geometry of the function to minimize and are provably convergent under appropriate (namely relative) smoothness and convexity conditions. Adapting these notions to the Wasserstein space, we prove guarantees of convergence of some Wasserstein-gradient-based discrete-time schemes for new pairings of objective functionals and regularizers. The difficulty here is to carefully select along which curves the functionals should be smooth and convex. We illustrate the advantages of adapting the geometry induced by the regularizer on ill-conditioned optimization tasks, and showcase the improvement of choosing different discrepancies and geometries in a computational biology task of aligning single-cells.

bregman divergence, convergence, convex, (12 more...)

2406.08938

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Vietnam (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.63)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.65)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)

arXiv.org Artificial IntelligenceJun-13-2024

CircuitVAE: Efficient and Scalable Latent Circuit Optimization

Song, Jialin, Swope, Aidan, Kirby, Robert, Roy, Rajarshi, Godil, Saad, Raiman, Jonathan, Catanzaro, Bryan

Automatically designing fast and space-efficient digital circuits is challenging because circuits are discrete, must exactly implement the desired logic, and are costly to simulate. We address these challenges with CircuitVAE, a search algorithm that embeds computation graphs in a continuous space and optimizes a learned surrogate of physical simulation by gradient descent. By carefully controlling overfitting of the simulation surrogate and ensuring diverse exploration, our algorithm is highly sample-efficient, yet gracefully scales to large problem instances and high sample budgets. We test CircuitVAE by designing binary adders across a large range of sizes, IO timing constraints, and sample budgets. Our method excels at designing large circuits, where other algorithms struggle: compared to reinforcement learning and genetic algorithms, CircuitVAE typically finds 64-bit adders which are smaller and faster using less than half the sample budget. We also find CircuitVAE can design state-of-the-art adders in a real-world chip, demonstrating that our method can outperform commercial tools in a realistic setting.

adder, circuitvae, optimization, (13 more...)

doi: 10.1145/3649329.3656543

2406.09535

Country:

North America > United States > California > San Francisco County > San Francisco (0.16)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.05)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.82)

Industry: Semiconductors & Electronics (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

arXiv.org Artificial IntelligenceJun-13-2024

A More Practical Approach to Machine Unlearning

Zagardo, David

Machine learning models often incorporate vast amounts of data, raising significant privacy concerns. Machine unlearning, the ability to remove the influence of specific data points from a trained model, addresses these concerns. This paper explores practical methods for implementing machine unlearning, focusing on a first-epoch gradient-ascent approach. Key findings include: 1. Single vs. Multi-Epoch Unlearning: First-epoch gradient unlearning is more effective than multi-epoch gradients. 2. Layer-Based Unlearning: The embedding layer in GPT-2 is crucial for effective unlearning. Gradients from the output layers (11 and 12) have no impact. Efficient unlearning can be achieved using only the embedding layer, halving space complexity. 3. Influence Functions & Scoring: Techniques like Hessian Vector Product and the dot product of activations and tensors are used for quantifying unlearning. 4. Gradient Ascent Considerations: Calibration is necessary to avoid overexposing the model to specific data points during unlearning, which could prematurely terminate the process. 5. Fuzzy Matching vs. Iterative Unlearning: Fuzzy matching techniques shift the model to a new optimum, while iterative unlearning provides a more complete modality. Our empirical evaluation confirms that first-epoch gradient ascent for machine unlearning is more effective than whole-model gradient ascent. These results highlight the potential of machine unlearning for enhancing data privacy and compliance with regulations such as GDPR and CCPA. The study underscores the importance of formal methods to comprehensively evaluate the unlearning process.

dataset, gradient, unlearning, (14 more...)

2406.09391

Country:

Europe > France (0.04)
Asia > China (0.04)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.69)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
(2 more...)

Ahmad, Tamim El, Yang, Junjie, Laforgue, Pierre, d'Alché-Buc, Florence

Deep Sketched Output Kernel Regression for Structured Prediction

arXiv.org Machine LearningJun-13-2024

By leveraging the kernel trick in the output space, kernel-induced losses provide a principled way to define structured output prediction tasks for a wide variety of output modalities. In particular, they have been successfully used in the context of surrogate non-parametric regression, where the kernel trick is typically exploited in the input space as well. However, when inputs are images or texts, more expressive models such as deep neural networks seem more suited than non-parametric methods. In this work, we tackle the question of how to train neural networks to solve structured output prediction tasks, while still benefiting from the versatility and relevance of kernel-induced losses. We design a novel family of deep neural architectures, whose last layer predicts in a data-dependent finite-dimensional subspace of the infinite-dimensional output feature space deriving from the kernel-induced loss. This subspace is chosen as the span of the eigenfunctions of a randomly-approximated version of the empirical kernel covariance operator. Interestingly, this approach unlocks the use of gradient descent algorithms (and consequently of any neural architecture) for structured prediction. Experiments on synthetic tasks as well as real-world supervised graph prediction problems show the relevance of our method.

international conference, kernel, prediction, (9 more...)

2406.09253

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(6 more...)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.46)

Azizian, Waïss, Iutzeler, Franck, Malick, Jérôme, Mertikopoulos, Panayotis

What is the long-run distribution of stochastic gradient descent? A large deviations analysis

arXiv.org Machine LearningJun-13-2024

In this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem's state space are more likely to be visited by SGD, and by how much. Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, we show that the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method's step-size and energy levels determined by the problem's objective and the statistics of the noise. In particular, we show that, in the long run, (a) the problem's critical region is visited exponentially more often than any non-critical region; (b) the iterates of SGD are exponentially concentrated around the problem's minimum energy state (which does not always coincide with the global minimum of the objective); (c) all other connected components of critical points are visited with frequency that is exponentially proportional to their energy level; and, finally (d) any component of local maximizers or saddle points is "dominated" by a component of local minimizers which is visited exponentially more often.

critical point, long-run distribution, sgd, (16 more...)

2406.09241

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
(6 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Vural, Nuri Mert, Erdogdu, Murat A.

Pruning is Optimal for Learning Sparse Features in High-Dimensions

arXiv.org Machine LearningJun-12-2024

While it is commonly observed in practice that pruning networks to a certain level of sparsity can improve the quality of the features, a theoretical explanation of this phenomenon remains elusive. In this work, we investigate this by demonstrating that a broad class of statistical models can be optimally learned using pruned neural networks trained with gradient descent, in high-dimensions. We consider learning both single-index and multi-index models of the form $y = \sigma^*(\boldsymbol{V}^{\top} \boldsymbol{x}) + \epsilon$, where $\sigma^*$ is a degree-$p$ polynomial, and $\boldsymbol{V} \in \mathbbm{R}^{d \times r}$ with $r \ll d$, is the matrix containing relevant model directions. We assume that $\boldsymbol{V}$ satisfies a certain $\ell_q$-sparsity condition for matrices and show that pruning neural networks proportional to the sparsity level of $\boldsymbol{V}$ improves their sample complexity compared to unpruned networks. Furthermore, we establish Correlational Statistical Query (CSQ) lower bounds in this setting, which take the sparsity level of $\boldsymbol{V}$ into account. We show that if the sparsity level of $\boldsymbol{V}$ exceeds a certain threshold, training pruned networks with a gradient descent algorithm achieves the sample complexity suggested by the CSQ lower bound. In the same scenario, however, our results imply that basis-independent methods such as models trained via standard gradient descent initialized with rotationally invariant random weights can provably achieve only suboptimal sample complexity.

neural network, nullv, statement follow, (15 more...)

2406.08658

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(6 more...)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)

Weigand, Lukas, Roith, Tim, Burger, Martin

Adversarial flows: A gradient flow characterization of adversarial attacks

arXiv.org Artificial IntelligenceJun-11-2024

A popular method to perform adversarial attacks on neuronal networks is the so-called fast gradient sign method and its iterative variant. In this paper, we interpret this method as an explicit Euler discretization of a differential inclusion, where we also show convergence of the discretization to the associated gradient flow. To do so, we consider the concept of p-curves of maximal slope in the case $p=\infty$. We prove existence of $\infty$-curves of maximum slope and derive an alternative characterization via differential inclusions. Furthermore, we also consider Wasserstein gradient flows for potential energies, where we show that curves in the Wasserstein space can be characterized by a representing measure on the space of curves in the underlying Banach space, which fulfill the differential inclusion. The application of our theory to the finite-dimensional setting is twofold: On the one hand, we show that a whole class of normalized gradient descent methods (in particular signed gradient descent) converge, up to subsequences, to the flow, when sending the step size to zero. On the other hand, in the distributional setting, we show that the inner optimization task of adversarial training objective can be characterized via $\infty$-curves of maximum slope on an appropriate optimal transport space.

cit, maximal slope, movement scheme, (16 more...)

2406.05376

Country:

Europe > Germany > Hamburg (0.04)
Europe > France (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
(5 more...)

Genre: Research Report > New Finding (0.45)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)