Goto

Collaborating Authors

 Regression


How To Choose The Best Machine Learning Algorithm For A Particular Problem?

#artificialintelligence

How do you know what machine learning algorithm to choose for your problem? Why don't we try all the machine learning algorithms or some of the algorithms which we consider will give good accuracy. If we apply each and every algorithm it will take a lot of time. So, it is better to apply a technique to identify the algorithm that can be used. Choosing the right algorithm is linked up with the problem statement.


Estimating Stochastic Linear Combination of Non-linear Regressions Efficiently and Scalably

arXiv.org Machine Learning

Recently, many machine learning and statistical models such as non-linear regressions, the Single Index, Multi-index, Varying Coefficient Index Models and Two-layer Neural Networks can be reduced to or be seen as a special case of a new model which is called the \textit{Stochastic Linear Combination of Non-linear Regressions} model. However, due to the high non-convexity of the problem, there is no previous work study how to estimate the model. In this paper, we provide the first study on how to estimate the model efficiently and scalably. Specifically, we first show that with some mild assumptions, if the variate vector $x$ is multivariate Gaussian, then there is an algorithm whose output vectors have $\ell_2$-norm estimation errors of $O(\sqrt{\frac{p}{n}})$ with high probability, where $p$ is the dimension of $x$ and $n$ is the number of samples. The key idea of the proof is based on an observation motived by the Stein's lemma. Then we extend our result to the case where $x$ is bounded and sub-Gaussian using the zero-bias transformation, which could be seen as a generalization of the classic Stein's lemma. We also show that with some additional assumptions there is an algorithm whose output vectors have $\ell_\infty$-norm estimation errors of $O(\frac{1}{\sqrt{p}}+\sqrt{\frac{p}{n}})$ with high probability. We also provide a concrete example to show that there exists some link function which satisfies the previous assumptions. Finally, for both Gaussian and sub-Gaussian cases we propose a faster sub-sampling based algorithm and show that when the sub-sample sizes are large enough then the estimation errors will not be sacrificed by too much. Experiments for both cases support our theoretical results. To the best of our knowledge, this is the first work that studies and provides theoretical guarantees for the stochastic linear combination of non-linear regressions model.


Learning Optimal Conditional Priors For Disentangled Representations

arXiv.org Machine Learning

A large part of the literature on learning disentangled representations focuses on variational autoencoders (VAEs). Recent developments demonstrate that disentanglement cannot be obtained in a fully unsupervised setting without inductive biases on models and data. As such, Khemakhem et al., AISTATS 2020, suggest employing a factorized prior distribution over the latent variables that is conditionally dependent on auxiliary observed variables complementing input observations. While this is a remarkable advancement toward model identifiability, the learned conditional prior only focuses on sufficiency, giving no guarantees on a minimal representation. Motivated by information theoretic principles, we propose a novel VAE-based generative model with theoretical guarantees on disentanglement. Our proposed model learns a sufficient and compact - thus optimal - conditional prior, which serves as regularization for the latent space. Experimental results indicate superior performance with respect to state-of-the-art methods, according to several established metrics proposed in the literature on disentanglement.


Interpretable Machine Learning -- A Brief History, State-of-the-Art and Challenges

arXiv.org Machine Learning

We present a brief history of the field of interpretable machine learning (IML), give an overview of state-of-the-art interpretation methods, and discuss challenges. Research in IML has boomed in recent years. As young as the field is, it has over 200 years old roots in regression modeling and rule-based machine learning, starting in the 1960s. Recently, many new IML methods have been proposed, many of them model-agnostic, but also interpretation techniques specific to deep learning and tree-based ensembles. IML methods either directly analyze model components, study sensitivity to input perturbations, or analyze local or global surrogate approximations of the ML model. The field approaches a state of readiness and stability, with many methods not only proposed in research, but also implemented in open-source software. But many important challenges remain for IML, such as dealing with dependent features, causal interpretation, and uncertainty estimation, which need to be resolved for its successful application to scientific problems. A further challenge is a missing rigorous definition of interpretability, which is accepted by the community. To address the challenges and advance the field, we urge to recall our roots of interpretable, data-driven modeling in statistics and (rule-based) ML, but also to consider other areas such as sensitivity analysis, causal inference, and the social sciences.


Monash University, UEA, UCR Time Series Extrinsic Regression Archive

arXiv.org Machine Learning

Time series research has gathered lots of interests in the last decade, especially for Time Series Classification (TSC) and Time Series Forecasting (TSF). Research in TSC has greatly benefited from the University of California Riverside and University of East Anglia (UCR/UEA) Time Series Archives. On the other hand, the advancement in Time Series Forecasting relies on time series forecasting competitions such as the Makridakis competitions, NN3 and NN5 Neural Network competitions, and a few Kaggle competitions. Each year, thousands of papers proposing new algorithms for TSC and TSF have utilized these benchmarking archives. These algorithms are designed for these specific problems, but may not be useful for tasks such as predicting the heart rate of a person using photoplethysmogram (PPG) and accelerometer data. We refer to this problem as Time Series Extrinsic Regression (TSER), where we are interested in a more general methodology of predicting a single continuous value, from univariate or multivariate time series. This prediction can be from the same time series or not directly related to the predictor time series and does not necessarily need to be a future value or depend heavily on recent values. To the best of our knowledge, research into TSER has received much less attention in the time series research community and there are no models developed for general time series extrinsic regression problems. Most models are developed for a specific problem. Therefore, we aim to motivate and support the research into TSER by introducing the first TSER benchmarking archive. This archive contains 19 datasets from different domains, with varying number of dimensions, unequal length dimensions, and missing values. In this paper, we introduce the datasets in this archive and did an initial benchmark on existing models.


Model-sharing Games: Analyzing Federated Learning Under Voluntary Participation

arXiv.org Machine Learning

Federated learning is a setting where agents, each with access to their own data source, combine models learned from local data to create a global model. If agents are drawing their data from different distributions, though, federated learning might produce a biased global model that is not optimal for each agent. This means that agents face a fundamental question: should they join the global model or stay with their local model? In this work, we show how this situation can be naturally analyzed through the framework of coalitional game theory. Motivated by these considerations, we propose the following game: there are heterogeneous players with different model parameters governing their data distribution and different amounts of data they have noisily drawn from their own distribution. Each player's goal is to obtain a model with minimal expected mean squared error (MSE) on their own distribution. They have a choice of fitting a model based solely on their own data, or combining their learned parameters with those of some subset of the other players. Combining models reduces the variance component of their error through access to more data, but increases the bias because of the heterogeneity of distributions. In this work, we derive exact expected MSE values for problems in linear regression and mean estimation. We use these values to analyze the resulting game in the framework of hedonic game theory; we study how players might divide into coalitions, where each set of players within a coalition jointly constructs a single model. In a case with arbitrarily many players that each have either a "small" or "large" amount of data, we constructively show that there always exists a stable partition of players into coalitions.


[D] Simple Questions Thread October 11, 2020

#artificialintelligence

The predict function in Python is Y X * Beta, where Y is a column vector, X is the design matrix, and Beta is the column vector of parameters. You could definitely programmatically create the equation in the form that you want though. I don't what function/module you are using for your regression. Are you processing the data into polynomial features, then feeding that to a linear regression model? In this PolynomialFeatures preprocessing class, you can use the .get_feature_names()


Multiple Linear Regression model using Python: Machine Learning

#artificialintelligence

If we look at the p-values of some of the variables, the values seem to be pretty high, which means they aren't significant. That means we can drop those variables from the model. Before dropping the variables, as discussed above, we have to see the multicollinearity between the variables. We do that by calculating the VIF value. Variance Inflation Factor or VIF is a quantitative value that says how much the feature variables are correlated with each other. It is an extremely important parameter to test our linear model.


Improve Linear Regression Using Statistics

#artificialintelligence

As a fresher in the field of machine learning, the first thing that you learn would be simple univariate linear regression. However, for the past decade or so, tree-based algorithms and neural networks have overshadowed the significance of linear regression on a commercial scale. The purpose of this blog post is to highlight why linear regression and other linear algorithms are still very relevant and how you can improve the performance of such rudimentary models to compete with large and sophisticated algorithms like XGBoost and Random Forests. Many self-taught data scientists start code first by learning how to implement various machine learning algorithms without actually understanding the mathematics behind these algorithms. By understanding the math behind these algorithms, we can get an idea about how to improve their performance.


Multi-fidelity data fusion for the approximation of scalar functions with low intrinsic dimensionality through active subspaces

arXiv.org Machine Learning

Gaussian processes are employed for non-parametric regression in a Bayesian setting. They generalize linear regression, embedding the inputs in a latent manifold inside an infinite-dimensional reproducing kernel Hilbert space. We can augment the inputs with the observations of low-fidelity models in order to learn a more expressive latent manifold and thus increment the model's accuracy. This can be realized recursively with a chain of Gaussian processes with incrementally higher fidelity. We would like to extend these multi-fidelity model realizations to case studies affected by a high-dimensional input space but with low intrinsic dimensionality. In this cases physical supported or purely numerical low-order models are still affected by the curse of dimensionality when queried for responses. When the model's gradient information is provided, the presence of an active subspace can be exploited to design low-fidelity response surfaces and thus enable Gaussian process multi-fidelity regression, without the need to perform new simulations. This is particularly useful in the case of data scarcity. In this work we present a multi-fidelity approach involving active subspaces and we test it on two different high-dimensional benchmarks.