Goto

Collaborating Authors

 surrogate data



Scaling laws for learning with real and surrogate data

Neural Information Processing Systems

Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g.



Scaling laws for learning with real and surrogate data

Neural Information Processing Systems

Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of n data points from the target distribution with data from more accessible sources, e.g. We refer to such data as surrogate data'. We study a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains.


Learning from the past: predicting critical transitions with machine learning trained on surrogates of historical data

Ma, Zhiqin, Zeng, Chunhua, Zhang, Yi-Cheng, Bury, Thomas M.

arXiv.org Artificial Intelligence

Complex systems can undergo critical transitions, where slowly changing environmental conditions trigger a sudden shift to a new, potentially catastrophic state. Early warning signals for these events are crucial for decision-making in fields such as ecology, biology and climate science. Generic early warning signals motivated by dynamical systems theory have had mixed success on real noisy data. More recent studies found that deep learning classifiers trained on synthetic data could improve performance. However, neither of these methods take advantage of historical, system-specific data. Here, we introduce an approach that trains machine learning classifiers directly on surrogate data of past transitions, namely surrogate data-based machine learning (SDML). The approach provides early warning signals in empirical and experimental data from geology, climatology, sociology, and cardiology with higher sensitivity and specificity than two widely used generic early warning signals -- variance and lag-1 autocorrelation. Since the approach is trained directly on surrogates of historical data, it is not bound by the restricting assumption of a local bifurcation like previous methods. This system-specific approach can contribute to improved early warning signals to help humans better prepare for or avoid undesirable critical transitions.


Scaling laws for learning with real and surrogate data

Jain, Ayush, Montanari, Andrea, Sasoglu, Eren

arXiv.org Artificial Intelligence

Collecting large quantities of high-quality data is often prohibitively expensive or impractical, and a crucial bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources like public datasets, data collected under different circumstances, or synthesized by generative models. Blurring distinctions, we refer to such data as `surrogate data'. We define a simple scheme for integrating surrogate data into training and use both theoretical models and empirical studies to explore its behavior. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution; $(ii)$ In order to reap this benefit, it is crucial to use optimally weighted empirical risk minimization; $(iii)$ The test error of models trained on mixtures of real and surrogate data is well described by a scaling law. This can be used to predict the optimal weighting and the gain from surrogate data.


Correlated Components Analysis --- Extracting Reliable Dimensions in Multivariate Data

Parra, Lucas C., Haufe, Stefan, Dmochowski, Jacek P.

arXiv.org Machine Learning

How does one find data dimensions that are reliably expressed across repetitions? For example, in neuroscience one may want to identify combinations of brain signals that are reliably activated across multiple trials or subjects. For a clinical assessment with multiple ratings, one may want to identify an aggregate score that is reliably reproduced across raters. The approach proposed here --- "correlated components analysis" --- is to identify components that maximally correlate between repetitions (e.g. trials, subjects, raters). This can be expressed as the maximization of the ratio of between-repetition to within-repetition covariance, resulting in a generalized eigenvalue problem. We show that covariances can be computed efficiently without explicitly considering all pairs of repetitions, that the result is equivalent to multi-class linear discriminant analysis for unbiased signals, and that the approach also maximize reliability, defined as the mean divided by the deviation across repetitions. We also extend the method to non-linear components using kernels, discuss regularization to improve numerical stability, present parametric and non-parametric tests to establish statistical significance, and provide code.


Slice sampling covariance hyperparameters of latent Gaussian models

Murray, Iain, Adams, Ryan P.

Neural Information Processing Systems

The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes.


Slice sampling covariance hyperparameters of latent Gaussian models

Murray, Iain, Adams, Ryan Prescott

arXiv.org Machine Learning

Computer Science University of Toronto The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong-and weak-data regimes.


Estimating the Reliability of ICA Projections

Meinecke, Frank C., Ziehe, Andreas, Kawanabe, Motoaki, Müller, Klaus-Robert

Neural Information Processing Systems

When applying unsupervised learning techniques like ICA or temporal decorrelation, a key question is whether the discovered projections are reliable. In other words: can we give error bars or can we assess the quality of our separation? We use resampling methods to tackle these questions and show experimentally that our proposed variance estimations are strongly correlated to the separation error. We demonstrate that this reliability estimation can be used to choose the appropriate ICA-model, to enhance significantly the separation performance, and, most important, to mark the components that have a actual physical meaning.