AITopics

2305.17665

Country:

Asia > China > Shanghai > Shanghai (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)

arXiv.org Artificial IntelligenceMay-28-2023

Which Features are Learnt by Contrastive Learning? On the Role of Simplicity Bias in Class Collapse and Feature Suppression

Xue, Yihao, Joshi, Siddharth, Gan, Eric, Chen, Pin-Yu, Mirzasoleiman, Baharan

Contrastive learning (CL) has emerged as a powerful technique for representation learning, with or without label supervision. However, supervised CL is prone to collapsing representations of subclasses within a class by not capturing all their features, and unsupervised CL may suppress harder class-relevant features by focusing on learning easy class-irrelevant features; both significantly compromise representation quality. Yet, there is no theoretical understanding of \textit{class collapse} or \textit{feature suppression} at \textit{test} time. We provide the first unified theoretically rigorous framework to determine \textit{which} features are learnt by CL. Our analysis indicate that, perhaps surprisingly, bias of (stochastic) gradient descent towards finding simpler solutions is a key factor in collapsing subclass representations and suppressing harder class-relevant features. Moreover, we present increasing embedding dimensionality and improving the quality of data augmentations as two theoretically motivated solutions to {feature suppression}. We also provide the first theoretical explanation for why employing supervised and unsupervised CL together yields higher-quality representations, even when using commonly-used stochastic gradient methods.

artificial intelligence, class collapse, machine learning, (15 more...)

2305.16536

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)

Genre: Research Report > New Finding (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)

Frangella, Zachary, Rathore, Pratik, Zhao, Shipu, Udell, Madeleine

SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates

arXiv.org Artificial IntelligenceMay-28-2023

SketchySGD improves upon existing stochastic gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian and by introducing an automated stepsize that works well across a wide range of convex machine learning problems. We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum. Further, in the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems. We validate this improvement empirically with ridge regression experiments on real data. Numerical experiments on both ridge and logistic regression problems with dense and sparse data, show that SketchySGD equipped with its default hyperparameters can achieve comparable or better results than popular stochastic gradient methods, even when they have been tuned to yield their best performance. In particular, SketchySGD is able to solve an ill-conditioned logistic regression problem with a data matrix that takes more than $840$GB RAM to store, while its competitors, even when tuned, are unable to make any progress. SketchySGD's ability to work out-of-the box with its default hyperparameters and excel on ill-conditioned problems is an advantage over other stochastic gradient methods, most of which require careful hyperparameter tuning (especially of the learning rate) to obtain good performance and degrade in the presence of ill-conditioning.

artificial intelligence, machine learning, sketchysgd, (19 more...)

2211.08597

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

Wang, Guanghui, Hu, Zihao, Muthukumar, Vidya, Abernethy, Jacob

Faster Margin Maximization Rates for Generic Optimization Methods

arXiv.org Artificial IntelligenceMay-27-2023

First-order optimization methods tend to inherently favor certain solutions over others when minimizing a given training objective with multiple local optima. This phenomenon, known as implicit bias, plays a critical role in understanding the generalization capabilities of optimization algorithms. Recent research has revealed that gradient-descent-based methods exhibit an implicit bias for the $\ell_2$-maximal margin classifier in the context of separable binary classification. In contrast, generic optimization methods, such as mirror descent and steepest descent, have been shown to converge to maximal margin classifiers defined by alternative geometries. However, while gradient-descent-based algorithms demonstrate fast implicit bias rates, the implicit bias rates of generic optimization methods have been relatively slow. To address this limitation, in this paper, we present a series of state-of-the-art implicit bias rates for mirror descent and steepest descent algorithms. Our primary technique involves transforming a generic optimization algorithm into an online learning dynamic that solves a regularized bilinear game, providing a unified framework for analyzing the implicit bias of various optimization methods. The accelerated rates are derived leveraging the regret bounds of online learning algorithms within this game framework.

algorithm, artificial intelligence, machine learning, (18 more...)

2305.17544

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry:

Education (0.56)
Leisure & Entertainment > Games (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Sakhi, Otmane, Rohde, David, Gilotte, Alexandre

Fast Offline Policy Optimization for Large Scale Recommendation

arXiv.org Artificial IntelligenceMay-27-2023

Personalised interactive systems such as recommender systems require selecting relevant items from massive catalogs dependent on context. Reward-driven offline optimisation of these systems can be achieved by a relaxation of the discrete problem resulting in policy learning or REINFORCE style learning algorithms. Unfortunately, this relaxation step requires computing a sum over the entire catalogue making the complexity of the evaluation of the gradient (and hence each stochastic gradient descent iterations) linear in the catalogue size. This calculation is untenable in many real world examples such as large catalogue recommender systems, severely limiting the usefulness of this method in practice. In this paper, we derive an approximation of these policy learning algorithms that scale logarithmically with the catalogue size. Our contribution is based upon combining three novel ideas: a new Monte Carlo estimate of the gradient of a policy, the self normalised importance sampling estimator and the use of fast maximum inner product search at training time. Extensive experiments show that our algorithm is an order of magnitude faster than naive approaches yet produces equally good policies.

algorithm, artificial intelligence, machine learning, (16 more...)

2208.05327

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

Zhang, Gavin, Chiu, Hong-Ming, Zhang, Richard Y.

Fast and Minimax Optimal Estimation of Low-Rank Matrices via Non-Convex Gradient Descent

We study the problem of estimating a low-rank matrix from noisy measurements, with the specific goal of achieving minimax optimal error. In practice, the problem is commonly solved using non-convex gradient descent, due to its ability to scale to large-scale real-world datasets. In theory, non-convex gradient descent is capable of achieving minimax error. But in practice, it often converges extremely slowly, such that it cannot even deliver estimations of modest accuracy within reasonable time. On the other hand, methods that improve the convergence of non-convex gradient descent, through rescaling or preconditioning, also greatly amplify the measurement noise, resulting in estimations that are orders of magnitude less accurate than what is theoretically achievable with minimax optimal error. In this paper, we propose a slight modification to the usual non-convex gradient descent method that remedies the issue of slow convergence, while provably preserving its minimax optimality. Our proposed algorithm has essentially the same per-iteration cost as non-convex gradient descent, but is guaranteed to converge to minimax error at a linear rate that is immune to ill-conditioning. Using our proposed algorithm, we reconstruct a 60 megapixel dataset for a medical imaging application, and observe significantly decreased reconstruction error compared to previous approaches.

algorithm, artificial intelligence, machine learning, (16 more...)

2305.17224

Country: North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Cutler, Joshua, Drusvyatskiy, Dmitriy, Harchaoui, Zaid

Stochastic Optimization under Distributional Drift

We consider the problem of minimizing a convex function that is evolving according to unknown and possibly stochastic dynamics, which may depend jointly on time and on the decision variable itself. Such problems abound in the machine learning and signal processing literature, under the names of concept drift, stochastic tracking, and performative prediction. We provide novel non-asymptotic convergence guarantees for stochastic algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability. The efficiency estimates we obtain clearly decouple the contributions of optimization error, gradient noise, and time drift. Notably, we identify a low drift-to-noise regime in which the tracking efficiency of the proximal stochastic gradient method benefits significantly from a step decay schedule. Numerical experiments illustrate our results.

artificial intelligence, machine learning, probability, (16 more...)

2108.07356

Country:

North America > United States > Washington > King County > Seattle (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)
Africa > South Sudan > Equatoria > Central Equatoria > Juba (0.04)

Genre: Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Refinetti, Maria, Ingrosso, Alessandro, Goldt, Sebastian

Neural networks trained with SGD learn distributions of increasing complexity

The ability of deep neural networks to generalise well even when they interpolate their training data has been explained using various "simplicity biases". These theories postulate that neural networks avoid overfitting by first learning simple functions, say a linear classifier, before learning more complex, non-linear functions. Meanwhile, data structure is also recognised as a key ingredient for good generalisation, yet its role in simplicity biases is not yet understood. Here, we show that neural networks trained using stochastic gradient descent initially classify their inputs using lower-order input statistics, like mean and covariance, and exploit higher-order statistics only later during training. We first demonstrate this distributional simplicity bias (DSB) in a solvable model of a neural network trained on synthetic data. We empirically demonstrate DSB in a range of deep convolutional networks and visual transformers trained on CIFAR10, and show that it even holds in networks pre-trained on ImageNet. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of Gaussian universality in learning.

artificial intelligence, machine learning, neural network, (18 more...)

2211.11567

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(14 more...)

Genre: Research Report (0.81)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Equalization in Dispersion-Managed Systems Using Learned Digital Back-Propagation

Abu-Romoh, Mohannad, Costa, Nelson, Jaouën, Yves, Napoli, Antonio, Pedro, João, Spinnler, Bernhard, Yousefi, Mansoor

In this paper, we investigate the use of the learned digital back-propagation (LDBP) for equalizing dual-polarization fiber-optic transmission in dispersion-managed (DM) links. LDBP is a deep neural network that optimizes the parameters of DBP using the stochastic gradient descent. We evaluate DBP and LDBP in a simulated WDM dual-polarization fiber transmission system operating at the bitrate of 256 Gbit/s per channel, with a dispersion map designed for a 2016 km link with 15% residual dispersion. Our results show that in single-channel transmission, LDBP achieves an effective signal-to-noise ratio improvement of 6.3 dB and 2.5 dB, respectively, over linear equalization and DBP. In WDM transmission, the corresponding $Q$-factor gains are 1.1 dB and 0.4 dB, respectively. Additionally, we conduct a complexity analysis, which reveals that a frequency-domain implementation of LDBP and DBP is more favorable in terms of complexity than the time-domain implementation. These findings demonstrate the effectiveness of LDBP in mitigating the nonlinear effects in DM fiber-optic transmission systems.

artificial intelligence, ldbp, machine learning, (20 more...)

2307.06821

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > France (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry: Energy (0.55)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Modulate Your Spectrum in Self-Supervised Learning

Weng, Xi, Ni, Yunhao, Song, Tengwei, Luo, Jie, Anwer, Rao Muhammad, Khan, Salman, Khan, Fahad Shahbaz, Huang, Lei

Whitening loss provides theoretical guarantee in avoiding feature collapse for self-supervised learning (SSL) using joint embedding architectures. One typical implementation of whitening loss is hard whitening that designs whitening transformation over embedding and imposes the loss on the whitened output. In this paper, we propose spectral transformation (ST) framework to map the spectrum of embedding to a desired distribution during forward pass, and to modulate the spectrum of embedding by implicit gradient update during backward pass. We show that whitening transformation is a special instance of ST by definition, and there exist other instances that can avoid collapse by our empirical investigation. Furthermore, we propose a new instance of ST, called IterNorm with trace loss (INTL). We theoretically prove that INTL can avoid collapse and modulate the spectrum of embedding towards an equal-eigenvalue distribution during the course of optimization. Moreover, INTL achieves 76.6% top-1 accuracy in linear evaluation on ImageNet using ResNet-50, which exceeds the performance of the supervised baseline, and this result is obtained by using a batch size of only 256. Comprehensive experiments show that INTL is a promising SSL method in practice. The code is available at https://github.com/winci-ai/intl.

artificial intelligence, covariance matrix, machine learning, (16 more...)

2305.16789

Country:

North America > United States (0.04)
Asia > Middle East > UAE (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)