AITopics | Loshchilov, Ilya

Collaborating Authors

Loshchilov, Ilya

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

nGPT: Normalized Transformer with Representation Learning on the Hypersphere

Loshchilov, Ilya, Hsieh, Cheng-Ping, Sun, Simeng, Ginsburg, Boris

arXiv.org Artificial IntelligenceOct-1-2024

We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.

artificial intelligence, machine learning, representation learning, (3 more...)

arXiv.org Artificial Intelligence

2410.01131

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.53)

Add feedback

Weight Norm Control

Loshchilov, Ilya

arXiv.org Artificial IntelligenceNov-20-2023

We note that decoupled weight decay regularization is a particular case of weight norm control where the target norm of weights is set to 0. Any optimization method (e.g., Adam) which uses decoupled weight decay regularization (respectively, AdamW) can be viewed as a particular case of a more general algorithm with weight norm control (respectively, AdamWN). We argue that setting the target norm of weights to 0 can be suboptimal and other target norm values can be considered. For instance, any training run where AdamW achieves a particular norm of weights can be challenged by AdamWN scheduled to achieve a comparable norm of weights. We discuss various implications of introducing weight norm control instead of weight decay. They introduced AdamW algorithm where Adam's loss-based update is decoupled from weight decay.

artificial intelligence, machine learning, weight norm, (18 more...)

arXiv.org Artificial Intelligence

2311.11446

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.49)

Add feedback

KL-based Control of the Learning Schedule for Surrogate Black-Box Optimization

Loshchilov, Ilya, Schoenauer, Marc, Sebag, Michèle

arXiv.org Machine LearningAug-18-2013

This paper investigates the control of an ML component within the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) devoted to black-box optimization. The known CMA-ES weakness is its sample complexity, the number of evaluations of the objective function needed to approximate the global optimum. This weakness is commonly addressed through surrogate optimization, learning an estimate of the objective function a.k.a. surrogate model, and replacing most evaluations of the true objective function with the (inexpensive) evaluation of the surrogate model. This paper presents a principled control of the learning schedule (when to relearn the surrogate model), based on the Kullback-Leibler divergence of the current search distribution and the training distribution of the former surrogate model. The experimental validation of the proposed approach shows significant performance gains on a comprehensive set of ill-conditioned benchmark problems, compared to the best state of the art including the quasi-Newton high-precision BFGS method.

air transportation, algorithm, optimization problem, (19 more...)

arXiv.org Machine Learning

1308.2655

Country:

Europe (0.14)
North America > United States (0.14)

Genre: Research Report (1.00)

Industry: Transportation > Air (0.62)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.89)

Add feedback