AITopics | Mishkin, Aaron

Collaborating Authors

Mishkin, Aaron

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Exploring the loss landscape of regularized neural networks via convex duality

Kim, Sungyoon, Mishkin, Aaron, Pilanci, Mert

arXiv.org Artificial IntelligenceNov-12-2024

We discuss several aspects of the loss landscape of regularized neural networks: the structure of stationary points, connectivity of optimal solutions, path with nonincreasing loss to arbitrary global optimum, and the nonuniqueness of optimal solutions, by casting the problem into an equivalent convex problem and considering its dual. Starting from two-layer neural networks with scalar output, we first characterize the solution set of the convex problem using its dual and further characterize all stationary points. With the characterization, we show that the topology of the global optima goes through a phase transition as the width of the network changes, and construct counterexamples where the problem may have a continuum of optimal solutions. Finally, we show that the solution set characterization and connectivity results can be extended to different architectures, including two-layer vector-valued neural networks and parallel three-layer neural networks.

artificial intelligence, machine learning, neural network, (16 more...)

arXiv.org Artificial Intelligence

2411.07729

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation

Mishkin, Aaron, Pilanci, Mert, Schmidt, Mark

arXiv.org Artificial IntelligenceApr-2-2024

A continuing trend in machine learning is the adoption of powerful prediction models which can exactly fit, or interpolate, their training data (Zhang et al., 2017). Methods such as over-parameterized neural networks (Zhang and Yin, 2013; Belkin et al., 2019a), kernel machines (Belkin et al., 2019b), and boosting (Schapire et al., 1997) have all been shown to achieve zero training loss in practice. This phenomena is particularly prevalent in modern deep learning, where interpolation is conjectured to be key to both optimization (Liu et al., 2022; Oymak and Soltanolkotabi, 2019) and generalization (Belkin, 2021). Recent experimental and theoretical evidence shows stochastic gradient descent(SGD) matches the fast convergence rates of deterministic gradient methods up to problemdependent constants when training interpolating models (Arora et al., 2018; Ma et al., 2018; Zou and Gu, 2019). With additional assumptions, interpolation also implies the strong (Polyak, 1987) and weak (Bassily et al., 2018; Vaswani et al., 2019) growth conditions, which bound the second moment of the stochastic gradients. Under strong/weak growth, variance-reduced algorithms typically exhibit slower convergence than stochastic gradient methods despite using more computation or memory (Defazio and Bottou, 2019; Ma et al., 2018), perhaps because these conditions already imply a form of "automatic variance reduction" (Liu et al., 2022). A combination of interpolation and growth conditions has been used to prove fast convergence rates for SGD with line-search (Vaswani et al., 2019), with the stochastic Polyak step-size (Loizou et al., 2020; Berrada et al., 2020), for mirror descent (D'Orazio et al., 2021), and for model-based methods (Asi and Duchi, 2019).

artificial intelligence, machine learning, stochastic agd, (16 more...)

arXiv.org Artificial Intelligence

2404.02378

Country: North America > United States > California (0.28)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Mishkin, Aaron, Khaled, Ahmed, Wang, Yuanhao, Defazio, Aaron, Gower, Robert M.

arXiv.org Artificial IntelligenceMar-6-2024

One way to avoid global smoothness of f is to use local Lipschitz continuity of the gradient ("local smoothness"). Local We develop new sub-optimality bounds for gradient smoothness uses different Lipschitz constants for different descent (GD) that depend on the conditioning neighbourhoods, thus avoiding global assumptions and obtaining of the objective along the path of optimization, improved rates. However, such analyses typically require rather than on global, worst-case constants. Key the iterates to be bounded, in which case local smoothness to our proofs is directional smoothness, a measure reduces to L-smoothness over a compact set (Malitsky of gradient variation that we use to develop upperbounds & Mishchenko, 2020). Boundedness can be enforced in a on the objective. Minimizing these upperbounds variety of ways: Zhang & Hong (2020) break optimization requires solving implicit equations to obtain into stages, Patel & Berahas (2022) develop a stopping-time a sequence of strongly adapted step-sizes; framework, and Lu & Mei (2023) use line-search and a modified we show that these equations are straightforward update. These approaches either modify the underlying to solve for convex quadratics and lead to new optimization algorithm, require local smoothness oracles guarantees for two classical step-sizes. For general (Park et al., 2021), or rely on highly complex arguments.

artificial intelligence, machine learning, optimization problem, (17 more...)

arXiv.org Artificial Intelligence

2403.04081

Country: North America > United States > California > Santa Clara County (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Level Set Teleportation: An Optimization Perspective

Mishkin, Aaron, Bietti, Alberto, Gower, Robert M.

arXiv.org Artificial IntelligenceMar-5-2024

We study level set teleportation, an optimization sub-routine which seeks to accelerate gradient methods by maximizing the gradient norm on a level-set of the objective function. Since the descent lemma implies that gradient descent (GD) decreases the objective proportional to the squared norm of the gradient, level-set teleportation maximizes this one-step progress guarantee. For convex functions satisfying Hessian stability, we prove that GD with level-set teleportation obtains a combined sub-linear/linear convergence rate which is strictly faster than standard GD when the optimality gap is small. This is in sharp contrast to the standard (strongly) convex setting, where we show level-set teleportation neither improves nor worsens convergence rates. To evaluate teleportation in practice, we develop a projected-gradient-type method requiring only Hessian-vector products. We use this method to show that gradient methods with access to a teleportation oracle uniformly out-perform their standard versions on a variety of learning problems.

artificial intelligence, machine learning, teleportation, (14 more...)

arXiv.org Artificial Intelligence

2403.03362

Country:

North America > United States > California (0.14)
North America > Canada > Quebec (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Add feedback

A Library of Mirrors: Deep Neural Nets in Low Dimensions are Convex Lasso Models with Reflection Features

Zeger, Emi, Wang, Yifei, Mishkin, Aaron, Ergen, Tolga, Candès, Emmanuel, Pilanci, Mert

arXiv.org Machine LearningMar-1-2024

We prove that training neural networks on 1-D data is equivalent to solving a convex Lasso problem with a fixed, explicitly defined dictionary matrix of features. The specific dictionary depends on the activation and depth. We consider 2-layer networks with piecewise linear activations, deep narrow ReLU networks with up to 4 layers, and rectangular and tree networks with sign activation and arbitrary depth. Interestingly in ReLU networks, a fourth layer creates features that represent reflections of training data about themselves.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Machine Learning

2403.01046

Country: North America > United States > California > Santa Clara County (0.14)

Genre: Research Report (0.82)

Industry: Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Analyzing and Improving Greedy 2-Coordinate Updates for Equality-Constrained Optimization via Steepest Descent in the 1-Norm

Ramesh, Amrutha Varshini, Mishkin, Aaron, Schmidt, Mark, Zhou, Yihan, Lavington, Jonathan Wilder, She, Jennifer

arXiv.org Artificial IntelligenceJul-3-2023

We consider minimizing a smooth function subject to a summation constraint over its variables. By exploiting a connection between the greedy 2-coordinate update for this problem and equality-constrained steepest descent in the 1-norm, we give a convergence rate for greedy selection under a proximal Polyak-Lojasiewicz assumption that is faster than random selection and independent of the problem dimension $n$. We then consider minimizing with both a summation constraint and bound constraints, as arises in the support vector machine dual problem. Existing greedy rules for this setting either guarantee trivial progress only or require $O(n^2)$ time to compute. We show that bound- and summation-constrained steepest descent in the L1-norm guarantees more progress per iteration than previous rules and can be computed in only $O(n \log n)$ time.

artificial intelligence, constraint, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2307.01169

Country: North America > United States > Texas (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.83)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)

Add feedback

Optimal Sets and Solution Paths of ReLU Networks

Mishkin, Aaron, Pilanci, Mert

arXiv.org Artificial IntelligenceMay-31-2023

We develop an analytical framework to characterize the set of optimal ReLU neural networks by reformulating the non-convex training problem as a convex program. We show that the global optima of the convex parameterization are given by a polyhedral set and then extend this characterization to the optimal set of the non-convex training objective. Since all stationary points of the ReLU training problem can be represented as optima of sub-sampled convex programs, our work provides a general expression for all critical points of the non-convex objective. We then leverage our results to provide an optimal pruning algorithm for computing minimal networks, establish conditions for the regularization path of ReLU networks to be continuous, and develop sensitivity results for minimal ReLU networks.

artificial intelligence, machine learning, optimal set and solution path, (15 more...)

arXiv.org Artificial Intelligence

2306.00119

Country:

Europe (0.92)
North America > United States > California > Los Angeles County > Long Beach (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)

Add feedback

To Each Optimizer a Norm, To Each Norm its Generalization

Vaswani, Sharan, Babanezhad, Reza, Gallego, Jose, Mishkin, Aaron, Lacoste-Julien, Simon, Roux, Nicolas Le

arXiv.org Machine LearningJun-11-2020

We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes. Since it is difficult to determine whether an optimizer converges to solutions that minimize a known norm, we flip the problem and investigate what is the corresponding norm minimized by an interpolating solution. Using this reasoning, we prove that for over-parameterized linear regression, projections onto linear spans can be used to move between different interpolating solutions. For under-parameterized linear classification, we prove that for any linear classifier separating the data, there exists a family of quadratic norms ||.||_P such that the classifier's direction is the same as that of the maximum P-margin solution. For linear classification, we argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data results in better generalization. Furthermore, for over-parameterized linear classification, projections onto the data-span enable us to use techniques from the under-parameterized setting. On the empirical side, we propose techniques to bias optimizers towards better generalizing solutions, improving their test performance. We validate our theoretical results via synthetic experiments, and use the neural tangent kernel to handle non-linear models.

converge, neural network, optimization problem, (20 more...)

arXiv.org Machine Learning

2006.06821

Country: North America > Canada (0.28)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Vaswani, Sharan, Mishkin, Aaron, Laradji, Issam, Schmidt, Mark, Gidel, Gauthier, Lacoste-Julien, Simon

Neural Information Processing SystemsMar-18-2020, 22:01:01 GMT

artificial intelligence, convergence rate, machine learning, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Vaswani, Sharan, Mishkin, Aaron, Laradji, Issam, Schmidt, Mark, Gidel, Gauthier, Lacoste-Julien, Simon

arXiv.org Machine LearningMay-23-2019

Recent works have shown that stochastic gradient descent (SGD) achieves the fast convergence rates of full-batch gradient descent for over-parameterized models satisfying certain interpolation conditions. However, the step-size used in these works depends on unknown quantities, and SGD's practical performance heavily relies on the choice of the step-size. We propose to use line-search methods to automatically set the step-size when training models that can interpolate the data. We prove that SGD with the classic Armijo line-search attains the fast convergence rates of full-batch gradient descent in convex and strongly-convex settings. We also show that under additional assumptions, SGD with a modified line-search can attain a fast rate of convergence for non-convex functions. Furthermore, we show that a stochastic extra-gradient method with a Lipschitz line-search attains a fast convergence rates for an important class of non-convex functions and saddle-point problems satisfying interpolation. We then give heuristics to use larger stepsizes and acceleration with our line-search techniques. We compare the proposed algorithms against numerous optimization methods for standard classification tasks using both kernel methods and deep networks. The proposed methods are robust and result in competitive performance across all models and datasets.

artificial intelligence, interpolation, machine learning, (15 more...)

arXiv.org Machine Learning

1905.09997

Country:

Europe (0.67)
North America > United States (0.28)
North America > Canada > Quebec (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback