Redko, Ievgen
Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification
Feofanov, Vasilii, Wen, Songkang, Alonso, Marius, Ilbert, Romain, Guo, Hongbo, Tiomoko, Malik, Pan, Lujia, Zhang, Jianfeng, Redko, Ievgen
In recent years, there has been increasing interest in developing foundation models for time series data that can generalize across diverse downstream tasks. While numerous forecasting-oriented foundation models have been introduced, there is a notable scarcity of models tailored for time series classification. To address this gap, we present Mantis, a new open-source foundation model for time series classification based on the Vision Transformer (ViT) architecture that has been pre-trained using a contrastive learning approach. Our experimental results show that Mantis outperforms existing foundation models both when the backbone is frozen and when fine-tuned, while achieving the lowest calibration error. In addition, we propose several adapters to handle the multivariate setting, reducing memory requirements and modeling channel interdependence.
Zero-shot Model-based Reinforcement Learning using Large Language Models
Benechehab, Abdelhakim, Hili, Youssef Attia El, Odonnat, Ambroise, Zekri, Oussama, Thomas, Albert, Paolo, Giuseppe, Filippone, Maurizio, Redko, Ievgen, Kรฉgl, Balรกzs
The emerging zero-shot capabilities of Large Language Models (LLMs) have led to their applications in areas extending well beyond natural language processing tasks. In reinforcement learning, while LLMs have been extensively used in text-based environments, their integration with continuous state spaces remains understudied. In this paper, we investigate how pre-trained LLMs can be leveraged to predict in context the dynamics of continuous Markov decision processes. We identify handling multivariate data and incorporating the control signal as key challenges that limit the potential of LLMs' deployment in this setup and propose Disentangled In-Context Learning (DICL) to address them. We present proof-of-concept applications in two reinforcement learning settings: model-based policy evaluation and data-augmented off-policy reinforcement learning, supported by theoretical analysis of the proposed methods. Our experiments further demonstrate that our approach produces well-calibrated uncertainty estimates. We release the code at https://github.com/abenechehab/dicl.
Large Language Models as Markov Chains
Zekri, Oussama, Odonnat, Ambroise, Benechehab, Abdelhakim, Bleistein, Linus, Boullรฉ, Nicolas, Redko, Ievgen
Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the origins of their impressive performance remains elusive. In this paper, we approach this challenging task by drawing an equivalence between generic autoregressive language models with vocabulary of size $T$ and context window of size $K$ and Markov chains defined on a finite state space of size $\mathcal{O}(T^K)$. We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter. We then prove pre-training and in-context generalization bounds and show how the drawn equivalence allows us to enrich their interpretation. Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.
Analysing Multi-Task Regression via Random Matrix Theory with Application to Time Series Forecasting
Ilbert, Romain, Tiomoko, Malik, Louart, Cosme, Odonnat, Ambroise, Feofanov, Vasilii, Palpanas, Themis, Redko, Ievgen
In this paper, we introduce a novel theoretical framework for multi-task regression, applying random matrix theory to provide precise performance estimations, under high-dimensional, non-Gaussian data distributions. We formulate a multi-task optimization problem as a regularization technique to enable single-task models to leverage multi-task learning information. We derive a closed-form solution for multi-task optimization in the context of linear models. Our analysis provides valuable insights by linking the multi-task learning performance to various model statistics such as raw data covariances, signal-generating hyperplanes, noise levels, as well as the size and number of datasets. We finally propose a consistent estimation of training and testing errors, thereby offering a robust foundation for hyperparameter optimization in multi-task regression scenarios. Experimental validations on both synthetic and real-world datasets in regression and multivariate time series forecasting demonstrate improvements on univariate models, incorporating our method into the training loss and thus leveraging multivariate information.
Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention
Ilbert, Romain, Odonnat, Ambroise, Feofanov, Vasilii, Virmaux, Aladin, Paolo, Giuseppe, Palpanas, Themis, Redko, Ievgen
Transformer-based architectures achieved breakthrough performance in natural language processing and computer vision, yet they remain inferior to simpler linear baselines in multivariate long-term forecasting. To better understand this phenomenon, we start by studying a toy linear forecasting problem for which we show that transformers are incapable of converging to their true solution despite their high expressive power. We further identify the attention of transformers as being responsible for this low generalization capacity. Building upon this insight, we propose a shallow lightweight transformer model that successfully escapes bad local minima when optimized with sharpness-aware optimization. We empirically demonstrate that this result extends to all commonly used real-world multivariate time series datasets. In particular, SAMformer surpasses the current state-of-the-art model TSMixer by 14.33% on average, while having ~4 times fewer parameters. The code is available at https://github.com/romilbert/samformer.
Characterising Gradients for Unsupervised Accuracy Estimation under Distribution Shift
Xie, Renchunzi, Odonnat, Ambroise, Feofanov, Vasilii, Redko, Ievgen, Zhang, Jianfeng, An, Bo
Estimating test accuracy without access to the ground-truth test labels under varying test environments is a challenging, yet extremely important problem in the safe deployment of machine learning algorithms. Existing works rely on the information from either the outputs or the extracted features of neural networks to formulate an estimation score correlating with the ground-truth test accuracy. In this paper, we investigate--both empirically and theoretically--how the information provided by the gradients can be predictive of the ground-truth test accuracy even under a distribution shift. Specifically, we use the norm of classification-layer gradients, backpropagated from the cross-entropy loss after only one gradient step over test data. Our key idea is that the model should be adjusted with a higher magnitude of gradients when it does not generalize to the test dataset with a distribution shift. We provide theoretical insights highlighting the main ingredients of such an approach ensuring its empirical success. Extensive experiments conducted on diverse distribution shifts and model structures demonstrate that our method significantly outperforms state-of-the-art algorithms.
Leveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection Bias
Odonnat, Ambroise, Feofanov, Vasilii, Redko, Ievgen
Self-training is a well-known approach for semi-supervised learning. It consists of iteratively assigning pseudo-labels to unlabeled data for which the model is confident and treating them as labeled examples. For neural networks, softmax prediction probabilities are often used as a confidence measure, despite the fact that they are known to be overconfident, even for wrong predictions. This phenomenon is particularly intensified in the presence of sample selection bias, i.e., when data labeling is subject to some constraint. To address this issue, we propose a novel confidence measure, called $\mathcal{T}$-similarity, built upon the prediction diversity of an ensemble of linear classifiers. We provide the theoretical analysis of our approach by studying stationary points and describing the relationship between the diversity of the individual members and their performance. We empirically demonstrate the benefit of our confidence measure for three different pseudo-labeling policies on classification datasets of various data modalities.
Understanding deep neural networks through the lens of their non-linearity
Bouniot, Quentin, Redko, Ievgen, Mallasto, Anton, Laclau, Charlotte, Arndt, Karol, Struckmeier, Oliver, Heinonen, Markus, Kyrki, Ville, Kaski, Samuel
The remarkable success of deep neural networks (DNN) is often attributed to their high expressive power and their ability to approximate functions of arbitrary complexity. Indeed, DNNs are highly non-linear models, and activation functions introduced into them are largely responsible for this. While many works studied the expressive power of DNNs through the lens of their approximation capabilities, quantifying the non-linearity of DNNs or of individual activation functions remains an open problem. In this paper, we propose the first theoretically sound solution to track non-linearity propagation in deep neural networks with a specific focus on computer vision applications. Our proposed affinity score allows us to gain insights into the inner workings of a wide range of different architectures and learning paradigms. We provide extensive experimental results that highlight the practical utility of the proposed affinity score and its potential for long-reaching applications. What makes deep neural networks so powerful? This question has been studied extensively since the very inception of the field and along several different paths. First contributions in this direction aimed to show that neural networks are universal approximators (Barron, 1994; Kurt & Hornik, 1991; Cybenko, 1989) and can fit any function to the desirable accuracy. Such results first required considered NNs to be of infinite width or depth: both constraints were finally relaxed to show that such property also holds for NNs in a finite regime (Hanin, 2017; Lu et al., 2017).
Learning representations that are closed-form Monge mapping optimal with application to domain adaptation
Struckmeier, Oliver, Redko, Ievgen, Mallasto, Anton, Arndt, Karol, Heinonen, Markus, Kyrki, Ville
Optimal transport (OT) is a powerful geometric tool used to compare and align probability measures following the least effort principle. Despite its widespread use in machine learning (ML), OT problem still bears its computational burden, while at the same time suffering from the curse of dimensionality for measures supported on general high-dimensional spaces. In this paper, we propose to tackle these challenges using representation learning. In particular, we seek to learn an embedding space such that the samples of the two input measures become alignable in it with a simple affine mapping that can be calculated efficiently in closed-form. We then show that such approach leads to results that are comparable to solving the original OT problem when applied to the transfer learning task on which many OT baselines where previously evaluated in both homogeneous and heterogeneous DA settings. The code for our contribution is available at \url{https://github.com/Oleffa/LaOT}.
Revisiting invariances and introducing priors in Gromov-Wasserstein distances
Demetci, Pinar, Tran, Quang Huy, Redko, Ievgen, Singh, Ritambhara
Gromov-Wasserstein distance has found many applications in machine learning due to its ability to compare measures across metric spaces and its invariance to isometric transformations. However, in certain applications, this invariance property can be too flexible, thus undesirable. Moreover, the Gromov-Wasserstein distance solely considers pairwise sample similarities in input datasets, disregarding the raw feature representations. We propose a new optimal transport-based distance, called Augmented Gromov-Wasserstein, that allows for some control over the level of rigidity to transformations. It also incorporates feature alignments, enabling us to better leverage prior knowledge on the input data for improved performance. We present theoretical insights into the proposed metric. We then demonstrate its usefulness for single-cell multi-omic alignment tasks and a transfer learning scenario in machine learning.