Goto

Collaborating Authors

 bottou


These Are Not All the Features You Are Looking For: A Fundamental Bottleneck in Supervised Pretraining

Yang, Xingyu Alice, Zhang, Jianyu, Bottou, Léon

arXiv.org Machine Learning

Transfer learning is a cornerstone of modern machine learning, promising a way to adapt models pretrained on a broad mix of data to new tasks with minimal new data. However, a significant challenge remains in ensuring that transferred features are sufficient to handle unseen datasets, amplified by the difficulty of quantifying whether two tasks are "related". To address these challenges, we evaluate model transfer from a pretraining mixture to each of its component tasks, assessing whether pretrained features can match the performance of task-specific direct training. We identify a fundamental limitation in deep learning models -- an "information saturation bottleneck" -- where networks fail to learn new features once they encode similar competing features during training. When restricted to learning only a subset of key features during pretraining, models will permanently lose critical features for transfer and perform inconsistently on data distributions, even components of the training mixture. Empirical evidence from published studies suggests that this phenomenon is pervasive in deep learning architectures -- factors such as data distribution or ordering affect the features that current representation learning methods can learn over time. This study suggests that relying solely on large-scale networks may not be as effective as focusing on task-specific training, when available. We propose richer feature representations as a potential solution to better generalize across new datasets and, specifically, present existing methods alongside a novel approach, the initial steps towards addressing this challenge.


Fine-tuning with Very Large Dropout

Zhang, Jianyu, Bottou, Léon

arXiv.org Artificial Intelligence

It is impossible today to pretend that the practice of machine learning is compatible with the idea that training and testing data follow the same distribution. Several authors have recently used ensemble techniques to show how scenarios involving multiple data distributions are best served by representations that are both richer than those obtained by regularizing for the best in-distribution performance, and richer than those obtained under the influence of the implicit sparsity bias of common stochastic gradient procedures. This contribution investigates the use of very high dropout rates instead of ensembles to obtain such rich representations. Although training a deep network from scratch using such dropout rates is virtually impossible, fine-tuning a large pre-trained model under such conditions is not only possible but also achieves out-of-distribution performances that exceed those of both ensembles and weight averaging methods such as model soups. This result has practical significance because the importance of the fine-tuning scenario has considerably grown in recent years. This result also provides interesting insights on the nature of rich representations and on the intrinsically linear nature of fine-tuning a large network using a comparatively small dataset.


Perceptrons, Reissue of the 1988 Expanded Edition with a new foreword by Léon Bottou: An Introduction to Computational Geometry (The MIT Press): Minsky, Marvin, Papert, Seymour A., Bottou, Leon: 9780262534772: Amazon.com: Books

#artificialintelligence

Perceptrons, Reissue of the 1988 Expanded Edition with a new foreword by Léon Bottou: An Introduction to Computational Geometry (The MIT Press) [Minsky, Marvin, Papert, Seymour A., Bottou, Leon] on Amazon.com. *FREE* shipping on qualifying offers. Perceptrons, Reissue of the 1988 Expanded Edition with a new foreword by Léon Bottou: An Introduction to Computational Geometry (The MIT Press)


The Paradigm Shift of Self-Supervised Learning

#artificialintelligence

"If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don't know how to make the cake." By 2016, Yann LeCun began to hedge with his use of the term "unsupervised learning". In NIPS 2016, he started to call it in even more nebulous terms "predictive learning": I have always had trouble with the use of the term "Unsupervised Learning". In 2017, I had predicted that Unsupervised Learning will not progress much and said "there seems to be a massive conceptual disconnect as to how exactly it should work" and that it was the "dark matter" of machine learning.


From GAN to WGAN

@machinelearnbot

This post explains the maths behind a generative adversarial network (GAN) model and why it is hard to be trained. Wasserstein GAN is intended to improve GANs' training by adopting a smooth metric for measuring the distance between two probability distributions. Generative adversarial network (GAN) has shown great results in many generative tasks to replicate the real-world rich content such as images, human language, and music. It is inspired by game theory: two models, a generator and a critic, are competing with each other while making each other stronger at the same time. However, it is rather challenging to train a GAN model, as people are facing issues like training instability or failure to converge. Here I would like to explain the maths behind the generative adversarial network framework, why it is hard to be trained, and finally introduce a modified version of GAN intended to solve the training difficulties.


Facebook's Quest to Build an Artificial Brain Depends on This Guy

AITopics Original Links

Mark Zuckerberg recently handpicked the longtime NYU professor to run Facebook's new artificial intelligence lab. The IEEE Computational Intelligence Society just gave him its prestigious Neural Network Pioneer Award, in honor of his work on deep learning, a form of artificial intelligence meant to more closely mimic the human brain. And, perhaps most of all, deep learning has suddenly spread across the commercial tech world, from Google to Microsoft to Baidu to Twitter, just a few years after most AI researchers openly scoffed at it. All of these tech companies are now exploring a particular type of deep learning called convolutional neural networks, aiming to build web services that can do things like automatically understand natural language and recognize images. At China's Baidu, they drive a new visual search engine.


Counterfactual Reasoning and Learning Systems

Bottou, Léon, Peters, Jonas, Quiñonero-Candela, Joaquin, Charles, Denis X., Chickering, D. Max, Portugaly, Elon, Ray, Dipankar, Simard, Patrice, Snelson, Ed

arXiv.org Artificial Intelligence

This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select changes that improve both the short-term and long-term performance of such systems. This work is illustrated by experiments carried out on the ad placement system associated with the Bing search engine.


Convergence Properties of the K-Means Algorithms

Bottou, Léon, Bengio, Yoshua

Neural Information Processing Systems

K-Means is a popular clustering algorithm used in many applications, including the initialization of more computationally expensive algorithms (Gaussian mixtures, Radial Basis Functions, Learning Vector Quantization and some Hidden Markov Models). The practice of this initialization procedure often gives the frustrating feeling that K-Means performs most of the task in a small fraction of the overall time. This motivated us to better understand this convergence speed. A second reason lies in the traditional debate between hard threshold (e.g.


Convergence Properties of the K-Means Algorithms

Bottou, Léon, Bengio, Yoshua

Neural Information Processing Systems

K-Means is a popular clustering algorithm used in many applications, including the initialization of more computationally expensive algorithms (Gaussian mixtures, Radial Basis Functions, Learning Vector Quantization and some Hidden Markov Models). The practice of this initialization procedure often gives the frustrating feeling that K-Means performs most of the task in a small fraction of the overall time. This motivated us to better understand this convergence speed. A second reason lies in the traditional debate between hard threshold (e.g.