Goto

Collaborating Authors

 Regression


Transformers learn in-context by gradient descent

arXiv.org Artificial Intelligence

At present, the mechanisms of in-context learning in Transformers are not well understood and remain mostly an intuition. In this paper, we suggest that training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We start by providing a simple weight construction that shows the equivalence of data transformations induced by 1) a single linear self-attention layer and by 2) gradient-descent (GD) on a regression loss. Motivated by that construction, we show empirically that when training self-attention-only Transformers on simple regression tasks either the models learned by GD and Transformers show great similarity or, remarkably, the weights found by optimization match the construction. Thus we show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass. This allows us, at least in the domain of regression problems, to mechanistically understand the inner workings of in-context learning in optimized Transformers. Building on this insight, we furthermore identify how Transformers surpass the performance of plain gradient descent by learning an iterative curvature correction and learn linear models on deep data representations to solve non-linear regression tasks. Finally, we discuss intriguing parallels to a mechanism identified to be crucial for in-context learning termed induction-head (Olsson et al., 2022) and show how it could be understood as a specific case of in-context learning by gradient descent learning within Transformers. Code to reproduce the experiments can be found at https://github.com/google-research/self-organising-systems/tree/master/transformers_learn_icl_by_gd .


EAMDrift: An interpretable self retrain model for time series

arXiv.org Artificial Intelligence

The use of machine learning for time series prediction has become increasingly popular across various industries thanks to the availability of time series data and advancements in machine learning algorithms. However, traditional methods for time series forecasting rely on pre-optimized models that are ill-equipped to handle unpredictable patterns in data. In this paper, we present EAMDrift, a novel method that combines forecasts from multiple individual predictors by weighting each prediction according to a performance metric. EAMDrift is designed to automatically adapt to out-of-distribution patterns in data and identify the most appropriate models to use at each moment through interpretable mechanisms, which include an automatic retraining process. Specifically, we encode different concepts with different models, each functioning as an observer of specific behaviors. The activation of the overall model then identifies which subset of the concept observers is identifying concepts in the data. This activation is interpretable and based on learned rules, allowing to study of input variables relations. Our study on real-world datasets shows that EAMDrift outperforms individual baseline models by 20% and achieves comparable accuracy results to non-interpretable ensemble models. These findings demonstrate the efficacy of EAMDrift for time-series prediction and highlight the importance of interpretability in machine learning models.


Online Label Shift: Optimal Dynamic Regret meets Practical Algorithms

arXiv.org Artificial Intelligence

This paper focuses on supervised and unsupervised online label shift, where the class marginals $Q(y)$ varies but the class-conditionals $Q(x|y)$ remain invariant. In the unsupervised setting, our goal is to adapt a learner, trained on some offline labeled data, to changing label distributions given unlabeled online data. In the supervised setting, we must both learn a classifier and adapt to the dynamically evolving class marginals given only labeled online data. We develop novel algorithms that reduce the adaptation problem to online regression and guarantee optimal dynamic regret without any prior knowledge of the extent of drift in the label distribution. Our solution is based on bootstrapping the estimates of \emph{online regression oracles} that track the drifting proportions. Experiments across numerous simulated and real-world online label shift scenarios demonstrate the superior performance of our proposed approaches, often achieving 1-3\% improvement in accuracy while being sample and computationally efficient. Code is publicly available at https://github.com/acmi-lab/OnlineLabelShift.


Zero-Shot Automatic Pronunciation Assessment

arXiv.org Artificial Intelligence

Automatic Pronunciation Assessment (APA) is vital for computer-assisted language learning. Prior methods rely on annotated speech-text data to train Automatic Speech Recognition (ASR) models or speech-score data to train regression models. In this work, we propose a novel zero-shot APA method based on the pre-trained acoustic model, HuBERT. Our method involves encoding speech input and corrupting them via a masking module. We then employ the Transformer encoder and apply k-means clustering to obtain token sequences. Finally, a scoring module is designed to measure the number of wrongly recovered tokens. Experimental results on speechocean762 demonstrate that the proposed method achieves comparable performance to supervised regression baselines and outperforms non-regression baselines in terms of Pearson Correlation Coefficient (PCC). Additionally, we analyze how masking strategies affect the performance of APA.


Regression with Sensor Data Containing Incomplete Observations

arXiv.org Artificial Intelligence

This paper addresses a regression problem in which output label values are the results of sensing the magnitude of a phenomenon. A low value of such labels can mean either that the actual magnitude of the phenomenon was low or that the sensor made an incomplete observation. This leads to a bias toward lower values in labels and the resultant learning because labels may have lower values due to incomplete observations, even if the actual magnitude of the phenomenon was high. Moreover, because an incomplete observation does not provide any tags indicating incompleteness, we cannot eliminate or impute them. To address this issue, we propose a learning algorithm that explicitly models incomplete observations corrupted with an asymmetric noise that always has a negative value. We show that our algorithm is unbiased as if it were learned from uncorrupted data that does not involve incomplete observations. We demonstrate the advantages of our algorithm through numerical experiments.


Simple Disentanglement of Style and Content in Visual Representations

arXiv.org Artificial Intelligence

Learning visual representations with interpretable features, i.e., disentangled representations, remains a challenging problem. Existing methods demonstrate some success but are hard to apply to large-scale vision datasets like ImageNet. In this work, we propose a simple post-processing framework to disentangle content and style in learned representations from pre-trained vision models. We model the pre-trained features probabilistically as linearly entangled combinations of the latent content and style factors and develop a simple disentanglement algorithm based on the probabilistic model. We show that the method provably disentangles content and style features and verify its efficacy empirically. Our post-processed features yield significant domain generalization performance improvements when the distribution shift occurs due to style changes or style-related spurious correlations.


Mind the Gap: Modelling Difference Between Censored and Uncensored Electric Vehicle Charging Demand

arXiv.org Artificial Intelligence

Electric vehicle charging demand models, with charging records as input, will inherently be biased toward the supply of available chargers. These models often fail to account for demand lost from occupied charging stations and competitors. The lost demand suggests that the actual demand is likely higher than the charging records reflect, i.e., the true demand is latent (unobserved), and the observations are censored. As a result, machine learning models that rely on these observed records for forecasting charging demand may be limited in their application in future infrastructure expansion and supply management, as they do not estimate the true demand for charging. We propose using censorship-aware models to model charging demand to address this limitation. These models incorporate censorship in their loss functions and learn the true latent demand distribution from observed charging records. We study how occupied charging stations and competing services censor demand using GPS trajectories from cars in Copenhagen, Denmark. We find that censorship occurs up to $61\%$ of the time in some areas of the city. We use the observed charging demand from our study to estimate the true demand and find that censorship-aware models provide better prediction and uncertainty estimation of actual demand than censorship-unaware models. We suggest that future charging models based on charging records should account for censoring to expand the application areas of machine learning models in supply management and infrastructure expansion.


Machine learning based iterative learning control for non-repetitive time-varying systems

arXiv.org Artificial Intelligence

The repetitive tracking task for time-varying systems (TVSs) with non-repetitive time-varying parameters, which is also called non-repetitive TVSs, is realized in this paper using iterative learning control (ILC). A machine learning (ML) based nominal model update mechanism, which utilizes the linear regression technique to update the nominal model at each ILC trial only using the current trial information, is proposed for non-repetitive TVSs in order to enhance the ILC performance. Given that the ML mechanism forces the model uncertainties to remain within the ILC robust tolerance, an ILC update law is proposed to deal with non-repetitive TVSs. How to tune parameters inside ML and ILC algorithms to achieve the desired aggregate performance is also provided. The robustness and reliability of the proposed method are verified by simulations. Comparison with current state-of-the-art demonstrates its superior control performance in terms of controlling precision. This paper broadens ILC applications from time-invariant systems to non-repetitive TVSs, adopts ML regression technique to estimate non-repetitive time-varying parameters between two ILC trials and proposes a detailed parameter tuning mechanism to achieve desired performance, which are the main contributions.


Predicting Rare Events by Shrinking Towards Proportional Odds

arXiv.org Machine Learning

Training classifiers is difficult with severe class imbalance, but many rare events are the culmination of a sequence with much more common intermediate outcomes. For example, in online marketing a user first sees an ad, then may click on it, and finally may make a purchase; estimating the probability of purchases is difficult because of their rarity. We show both theoretically and through data experiments that the more abundant data in earlier steps may be leveraged to improve estimation of probabilities of rare events. We present PRESTO, a relaxation of the proportional odds model for ordinal regression. Instead of estimating weights for one separating hyperplane that is shifted by separate intercepts for each of the estimated Bayes decision boundaries between adjacent pairs of categorical responses, we estimate separate weights for each of these transitions. We impose an L1 penalty on the differences between weights for the same feature in adjacent weight vectors in order to shrink towards the proportional odds model. We prove that PRESTO consistently estimates the decision boundary weights under a sparsity assumption. Synthetic and real data experiments show that our method can estimate rare probabilities in this setting better than both logistic regression on the rare category, which fails to borrow strength from more abundant categories, and the proportional odds model, which is too inflexible.


A machine learning approach to the prediction of heat-transfer coefficients in micro-channels

arXiv.org Artificial Intelligence

The accurate prediction of the two-phase heat transfer coefficient (HTC) as a function of working fluids, channel geometries and process conditions is key to the optimal design and operation of compact heat exchangers. Advances in artificial intelligence research have recently boosted the application of machine learning (ML) algorithms to obtain data-driven surrogate models for the HTC. For most supervised learning algorithms, the task is that of a nonlinear regression problem. Despite the fact that these models have been proven capable of outperforming traditional empirical correlations, they have key limitations such as overfitting the data, the lack of uncertainty estimation, and interpretability of the results. To address these limitations, in this paper, we use a multi-output Gaussian process regression (GPR) to estimate the HTC in microchannels as a function of the mass flow rate, heat flux, system pressure and channel diameter and length. The model is trained using the Brunel Two-Phase Flow database of high-fidelity experimental data. The advantages of GPR are data efficiency, the small number of hyperparameters to be trained (typically of the same order of the number of input dimensions), and the automatic trade-off between data fit and model complexity guaranteed by the maximization of the marginal likelihood (Bayesian approach). Our paper proposes research directions to improve the performance of the GPR-based model in extrapolation.