AITopics

2502.13283

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.89)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

arXiv.org Artificial IntelligenceFeb-25-2024

Spectrum Extraction and Clipping for Implicitly Linear Layers

Boroojeny, Ali Ebrahimpour, Telgarsky, Matus, Sundaram, Hari

Implicitly linear layers are key components of deep learning models and include any layer whose output can be written as an affine function of their input. This affine function might be trivial, such as a dense layer, or non-trivial, such as convolutional layers. These layers inherit appealing properties of linear transformations; not only are they flexible and easy to train, but also they have the same Jacobian as the transformation that the layer represents and a Lipschitz constant equal to its largest singular value. Therefore controlling the largest singular value of these layers, which is the same as the largest singular value of their Jacobians, not only contributes to the generalization of the model (Bartlett et al., 2017), but makes the model more robust to adversarial perturbations (Szegedy et al., 2013; Weng et al., 2018), and prevents the gradients from exploding or vanishing during backpropagation. Although efficient algorithms have been introduced to bound the spectral norm of dense layers (Miyato et al., 2018), computing and bounding them efficiently and correctly has been a challenge for the general family of implicitly linear layers, such as convolutional layers. Convolutional layers are a major class of implicitly linear layers that are used in many models in various domains. They are compressed forms of linear transformations with an effective rank that depends on the dimensions of input rather than their filters.

artificial intelligence, deep learning, machine learning, (17 more...)

2402.16017

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningFeb-24-2024

Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Wu, Jingfeng, Bartlett, Peter L., Telgarsky, Matus, Yu, Bin

We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T^2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.

artificial intelligence, machine learning, stepsize, (15 more...)

2402.15926

Country: North America > United States > New York (0.14)

Genre: Research Report > New Finding (0.55)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

arXiv.org Artificial IntelligenceFeb-14-2024

Transformers, parallel computation, and logarithmic depth

Sanford, Clayton, Hsu, Daniel, Telgarsky, Matus

The transformer (Vaswani et al., 2017) has emerged as the dominant neural architecture for many sequential modeling tasks such as machine translation (Radford et al., 2019) and protein folding (Jumper et al., 2021). Reasons for the success of transformers include suitability to modern hardware and training stability: unlike in recurrent models, inference and training can be efficiently parallelized, and training is less vulnerable to vanishing and exploding gradients. However, the advantages of transformers over other neural architectures can be understood more fundamentally via the lens of representation, which regards neural nets as parameterized functions and asks what they can efficiently compute. Many previous theoretical studies of transformers establish (approximation-theoretic and computational) universality properties, but only at large model sizes (Yun et al., 2020; Pérez et al., 2021). These results are not unique to transformers and reveal little about which tasks can be solved in a size-efficient manner.

artificial intelligence, machine learning, natural language, (21 more...)

2402.09268

Country: North America > United States > New York > New York County > New York City (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

arXiv.org Machine LearningNov-16-2023

Representational Strengths and Limitations of Transformers

Sanford, Clayton, Hsu, Daniel, Telgarsky, Matus

Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both positive and negative results on the representation power of attention layers, with a focus on intrinsic complexity parameters such as width, depth, and embedding dimension. On the positive side, we present a sparse averaging task, where recurrent networks and feedforward networks all have complexity scaling polynomially in the input size, whereas transformers scale merely logarithmically in the input size; furthermore, we use the same construction to show the necessity and role of a large embedding dimension in a transformer. On the negative side, we present a triple detection task, where attention layers in turn have complexity scaling linearly in the input size; as this scenario seems rare in practice, we also present natural variants that can be efficiently solved by attention layers. The proof techniques emphasize the value of communication complexity in the analysis of transformers and related models, and the role of sparse averaging as a prototypical attention task, which even finds use in the analysis of triple detection.

artificial intelligence, machine learning, natural language, (20 more...)

2306.02896

Country: North America > United States > New York (0.14)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceJun-13-2023

On Achieving Optimal Adversarial Test Error

Li, Justin D., Telgarsky, Matus

We first elucidate various fundamental properties of optimal adversarial predictors: the structure of optimal adversarial convex predictors in terms of optimal adversarial zero-one predictors, bounds relating the adversarial convex loss to the adversarial zero-one loss, and the fact that continuous predictors can get arbitrarily close to the optimal adversarial error for both convex and zero-one losses. Applying these results along with new Rademacher complexity bounds for adversarial training near initialization, we prove that for general data distributions and perturbation sets, adversarial training on shallow networks with early stopping and an idealized optimal adversary is able to achieve optimal adversarial test error. By contrast, prior theoretical work either considered specialized data distributions or only provided training error guarantees. Imperceptibly altering the input data in a malicious fashion can dramatically decrease the accuracy of neural networks (Szegedy et al., 2014). To defend against such adversarial attacks, maliciously altered training examples can be incorporated into the training process, encouraging robustness in the final neural network. Differing types of attacks used during this adversarial training, such as FGSM (Goodfellow et al., 2015), PGD (Madry et al., 2019), and the C&W attack (Carlini & Wagner, 2016), which are optimization-based procedures that try to find bad perturbations around the inputs, have been shown to help with robustness. While many other defenses have been proposed (Guo et al., 2017; Dhillon et al., 2018; Xie et al., 2017), adversarial training is the standard approach (Athalye et al., 2018). Despite many advances, a large gap still persists between the accuracies we are able to achieve on non-adversarial and adversarial test sets. For instance, in Madry et al. (2019), a wide ResNet model was able to achieve 95% accuracy on CIFAR-10 with standard training, but only 46% accuracy on CIFAR-10 images with perturbations arising from PGD bounded by 8/255 in each coordinate, even with the benefit of adversarial training. In this work we seek to better understand the optimal adversarial predictors we are trying to achieve, as well as how adversarial training can help us get there. While several recent works have analyzed properties of optimal adversarial zero-one classifiers (Bhagoji et al., 2019; Pydi & Jog, 2020; Awasthi et al., 2021b), in the present work we build off of these analyses to characterize optimal adversarial convex surrogate loss classifiers.

artificial intelligence, machine learning, predictor, (19 more...)

2306.07544

Country:

North America > United States > Illinois (0.14)
North America > Canada (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.68)

arXiv.org Artificial IntelligenceJan-11-2023

Convex Analysis at Infinity: An Introduction to Astral Space

Dudík, Miroslav, Schapire, Robert E., Telgarsky, Matus

Not all convex functions on $\mathbb{R}^n$ have finite minimizers; some can only be minimized by a sequence as it heads to infinity. In this work, we aim to develop a theory for understanding such minimizers at infinity. We study astral space, a compact extension of $\mathbb{R}^n$ to which such points at infinity have been added. Astral space is constructed to be as small as possible while still ensuring that all linear functions can be continuously extended to the new space. Although astral space includes all of $\mathbb{R}^n$, it is not a vector space, nor even a metric space. However, it is sufficiently well-structured to allow useful and meaningful extensions of concepts of convexity, conjugacy, and subdifferentials. We develop these concepts and analyze various properties of convex functions on astral space, including the detailed structure of their minimizers, exact characterizations of continuity, and convergence of descent algorithms.

artificial intelligence, convex analysis, machine learning, (2 more...)

2205.0326

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.53)

arXiv.org Machine LearningFeb-14-2022

Stochastic linear optimization never overfits with quadratically-bounded losses on general data

Telgarsky, Matus

This work shows that a diverse collection of linear optimization methods, when run on general data, fail to overfit, despite lacking any explicit constraints or regularization: with high probability, their trajectories stay near the curve of optimal constrained solutions over the population distribution. This analysis is powered by an elementary but flexible proof scheme which can handle many settings, summarized as follows. Firstly, the data can be general: unlike other implicit bias works, it need not satisfy large margin or other structural conditions, and moreover can arrive sequentially IID, sequentially following a Markov chain, as a batch, and lastly it can have heavy tails. Secondly, while the main analysis is for mirror descent, rates are also provided for the Temporal-Difference fixed-point method from reinforcement learning; all prior high probability analyses in these settings required bounded iterates, bounded updates, bounded noise, or some equivalent. Thirdly, the losses are general, and for instance the logistic and squared losses can be handled simultaneously, unlike other implicit bias works. In all of these settings, not only is low population error guaranteed with high probability, but moreover low sample complexity is guaranteed so long as there exists any low-complexity near-optimal solution, even if the global problem structure and in particular global optima have high complexity.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

2202.06915

Genre: Research Report (0.50)

Industry: Transportation > Air (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.47)

arXiv.org Machine LearningAug-21-2021

Fast Margin Maximization via Dual Acceleration

Ji, Ziwei, Srebro, Nathan, Telgarsky, Matus

We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$. This contrasts with a rate of $\mathcal{O}(1/\log(t))$ for standard gradient descent, and $\mathcal{O}(1/t)$ for normalized gradient descent. This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual, which manages to result in a simple and intuitive method in the primal. This dual view can also be used to derive a stochastic variant, which performs adaptive non-uniform sampling via the dual variables.

artificial intelligence, fast margin maximization, neural network, (16 more...)

2107.00595

Country: North America > United States > Illinois (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

arXiv.org Machine LearningJun-10-2021

Early-stopped neural networks are consistent

Ji, Ziwei, Li, Justin D., Telgarsky, Matus

This work studies the behavior of neural networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the (optimal) Bayes risk is not necessarily zero. In this setting, it is shown that gradient descent with early stopping achieves population risk arbitrarily close to optimal in terms of not just logistic and misclassification losses, but also in terms of calibration, meaning the sigmoid mapping of its outputs approximates the true underlying conditional distribution arbitrarily finely. Moreover, the necessary iteration, sample, and architectural complexities of this analysis all scale naturally with a certain complexity measure of the true conditional model. Lastly, while it is not shown that early stopping is necessary, it is shown that any univariate classifier satisfying a local interpolation property is necessarily inconsistent.

artificial intelligence, neural network, probability, (19 more...)

2106.05932

Country: North America > United States > Illinois (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)