Goto

Collaborating Authors

 Regression


Obliviousness Makes Poisoning Adversaries Weaker

arXiv.org Machine Learning

Poisoning attacks have emerged as a significant security threat to machine learning (ML) algorithms. It has been demonstrated that adversaries who make small changes to the training set, such as adding specially crafted data points, can hurt the performance of the output model. Most of these attacks require the full knowledge of training data or the underlying data distribution. In this paper we study the power of oblivious adversaries who do not have any information about the training set. We show a separation between oblivious and full-information poisoning adversaries. Specifically, we construct a sparse linear regression problem for which LASSO estimator is robust against oblivious adversaries whose goal is to add a non-relevant features to the model with certain poisoning budget. On the other hand, non-oblivious adversaries, with the same budget, can craft poisoning examples based on the rest of the training data and successfully add non-relevant features to the model.


Uncertainty Estimation in Cancer Survival Prediction

arXiv.org Machine Learning

Survival models are used in various fields, such as the development of cancer treatment protocols. Although many statistical and machine learning models have been proposed to achieve accurate survival predictions, little attention has been paid to obtain well-calibrated uncertainty estimates associated with each prediction. The currently popular models are opaque and untrustworthy in that they often express high confidence even on those test cases that are not similar to the training samples, and even when their predictions are wrong. We propose a Bayesian framework for survival models that not only gives more accurate survival predictions but also quantifies the survival uncertainty better. Our approach is a novel combination of variational inference for uncertainty estimation, neural multi-task logistic regression for estimating nonlinear and time-varying risk models, and an additional sparsity-inducing prior to work with high dimensional data.


Logistic Regression examples in python & R

#artificialintelligence

In every algorithm of machine learning, there is an approach that is unique yet easily interpretable. Logistic regression is one such algorithm with an easy and unique approach. It is very often used in the credit and risk industry for its easy intuition on predicting the chances of default and risk cases. It is indeed quite a challenge to break down most of the algorithms due to their black-box nature and their hard to find parameters, but logistic regression outperforms all. So it is time to break down the entire algorithm and draw some inferences.


Probabilistic forecasting approaches for extreme NO$_2$ episodes: a comparison of models

arXiv.org Machine Learning

In order to take preventive steps to maintain air quality, forecasting the evolution of pollution levels becomes a useful tool for decision makers: detecting pollution peaks beforehand could give cities enough time to take and communicate effective measures. Multiple research papers have focused on this issue and have dealt with the prediction of air quality. Bai et al. [1] describes the state of the art in this exercise and collects a range of diverse solutions applied to this problem. However, the prediction of the expected value of pollution concentrations through point-forecasting does not provide enough information about the likelihood of the pollutant levels reaching a certain threshold. Indeed, we have an estimate but we usually do not have a description of the confidence of the model nor the uncertainty in the predictions. Therefore, it is difficult to estimate the probability of the pollutant reaching above a certain threshold. The reason this probability estimation is so important is because the measures taken by cities to limit pollution (for example, limiting traffic) impact the daily routines of citizens and prove themselves to be quite unpopular. Therefore, local governments need to have an estimation of the confidence in the prediction to safely engage in those preventive measures.


Dimension Independent Generalization Error with Regularized Online Optimization

arXiv.org Machine Learning

One classical canon of statistics is that large models are prone to overfitting and model selection procedures are necessary for high-dimensional data. However, many overparameterized models such as neural networks, which are often trained with simple online methods and regularization, perform very well in practice. The empirical success of overparameterized models, which is often known as benign overfitting, motivates us to have a new look at the statistical generalization theory for online optimization. In particular, we present a general theory on the generalization error of stochastic gradient descent (SGD) for both convex and non-convex loss functions. We further provide the definition of "low effective dimension" so that the generalization error either does not depend on the ambient dimension $p$ or depends on $p$ via a poly-logarithmic factor. We also demonstrate on several widely used statistical models that the "low effect dimension" arises naturally in overparameterized settings. The studied statistical applications include both convex models such as linear regression and logistic regression, and non-convex models such as $M$-estimator and two-layer neural networks.


Efficient Algorithms for Multidimensional Segmented Regression

arXiv.org Machine Learning

We study the fundamental problem of fixed design {\em multidimensional segmented regression}: Given noisy samples from a function $f$, promised to be piecewise linear on an unknown set of $k$ rectangles, we want to recover $f$ up to a desired accuracy in mean-squared error. We provide the first sample and computationally efficient algorithm for this problem in any fixed dimension. Our algorithm relies on a simple iterative merging approach, which is novel in the multidimensional setting. Our experimental evaluation on both synthetic and real datasets shows that our algorithm is competitive and in some cases outperforms state-of-the-art heuristics. Code of our implementation is available at \url{https://github.com/avoloshinov/multidimensional-segmented-regression}.


How machine learning can power your business

#artificialintelligence

An unprecedented volume of data is currently being generated across the globe with no less than an estimated 2.5 quintillion (1018) bytes of data each day at our current pace. The variety of formats in which this data is being produced, and its structural complexity, are also on the rise. Collectively, these factors are driving demand among institutions for advanced analytics to generate actionable insights. At the most basic level, machine learning encompasses the use of computational algorithms more advanced than the analytics methods (data mining approaches, for example) traditionally employed to deliver insights into large datasets. Machine learning techniques are firmly rooted in the science of statistics and have valuable applications not least in financial services.


Neural Networks and Polynomial Regression. Demystifying the Overparametrization Phenomena

arXiv.org Machine Learning

In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data, even though the number of parameters significantly exceeds the sample sizes, and the model perfectly fits the in-training data. A conventional explanation of this phenomena is based on self-regularization properties of algorithms used to train the data. In this paper we prove a series of results which provide a somewhat diverging explanation. Adopting a teacher/student model where the teacher network is used to generate the predictions and student network is trained on the observed labeled data, and then tested on out-of-sample data, we show that any student network interpolating the data generated by a teacher network generalizes well, provided that the sample size is at least an explicit quantity controlled by data dimension and approximation guarantee alone, regardless of the number of internal nodes of either teacher or student network. Our claim is based on approximating both teacher and student networks by polynomial (tensor) regression models with degree depending on the desired accuracy and network depth only. Such a parametrization notably does not depend on the number of internal nodes. Thus a message implied by our results is that parametrizing wide neural networks by the number of hidden nodes is misleading, and a more fitting measure of parametrization complexity is the number of regression coefficients associated with tensorized data. In particular, this somewhat reconciles the generalization ability of neural networks with more classical statistical notions of data complexity and generalization bounds. Our empirical results on MNIST and Fashion-MNIST datasets indeed confirm that tensorized regression achieves a good out-of-sample performance, even when the degree of the tensor is at most two.


Efficient Tensor Kernel methods for sparse regression

arXiv.org Machine Learning

Recently, classical kernel methods have been extended by the introduction of suitable tensor kernels so to promote sparsity in the solution of the underlying regression problem. Indeed, they solve an lp-norm regularization problem, with p=m/(m-1) and m even integer, which happens to be close to a lasso problem. However, a major drawback of the method is that storing tensors requires a considerable amount of memory, ultimately limiting its applicability. In this work we address this problem by proposing two advances. First, we directly reduce the memory requirement, by intriducing a new and more efficient layout for storing the data. Second, we use a Nystrom-type subsampling approach, which allows for a training phase with a smaller number of data points, so to reduce the computational cost. Experiments, both on synthetic and read datasets, show the effectiveness of the proposed improvements. Finally, we take case of implementing the cose in C++ so to further speed-up the computation.


Julia Language in Machine Learning: Algorithms, Applications, and Open Issues

arXiv.org Machine Learning

Machine learning is driving development across many fields in science and engineering. A simple and efficient programming language could accelerate applications of machine learning in various fields. Currently, the programming languages most commonly used to develop machine learning algorithms include Python, MATLAB, and C/C ++. However, none of these languages well balance both efficiency and simplicity. The Julia language is a fast, easy-to-use, and open-source programming language that was originally designed for high-performance computing, which can well balance the efficiency and simplicity. This paper summarizes the related research work and developments in the application of the Julia language in machine learning. It first surveys the popular machine learning algorithms that are developed in the Julia language. Then, it investigates applications of the machine learning algorithms implemented with the Julia language. Finally, it discusses the open issues and the potential future directions that arise in the use of the Julia language in machine learning.