Goto

Collaborating Authors

 Regression


Deploying Machine Learning Models with Heroku

#artificialintelligence

For starters, deployment is the process of integrating a trained machine learning model into a production environment, usually intended to serve an end-user. Deployment is typically the last stage in the development lifecycle of a machine learning product. The "Model Deployment" stage above consists of a series of steps which are shown in the image below: For the purpose of this tutorial, I will use Flask to build the web application. In this section, let's train the machine learning model we intend to deploy. For simplicity and to not divert from the primary objective of this post, I will deploy a linear regression model.


Support Recovery in Mixture Models with Sparse Parameters

arXiv.org Artificial Intelligence

Mixture models are widely used to fit complex and multimodal datasets. In this paper we study mixtures with high dimensional sparse latent parameter vectors and consider the problem of support recovery of those vectors. While parameter learning in mixture models is well-studied, the sparsity constraint remains relatively unexplored. Sparsity of parameter vectors is a natural constraint in variety of settings, and support recovery is a major step towards parameter estimation. We provide efficient algorithms for support recovery that have a logarithmic sample complexity dependence on the dimensionality of the latent space. Our algorithms are quite general, namely they are applicable to 1) mixtures of many different canonical distributions including Uniform, Poisson, Laplace, Gaussians, etc. 2) Mixtures of linear regressions and linear classifiers with Gaussian covariates under different assumptions on the unknown parameters. In most of these settings, our results are the first guarantees on the problem while in the rest, our results provide improvements on existing works.


Understanding Functions in AI

#artificialintelligence

Every single data transformation we do in Artificial intelligence seeks to convert input-data to the most representative format required for the task we aim to solveโ€ฆ This conversion is done through functions. A machine-learning model transforms its input data into meaningful outputs. A process that is "learned" from exposure to known examples of inputs and outputs. Thus, the ML-model "learns a function" that maps its input data to the expected output. We have a table of a few data points, some belong to a "white" class and others to a "black" class.


Quantum Sparse Coding

arXiv.org Machine Learning

A ubiquitous problem in machine learning, statistics, and signal processing is to accurately estimate an unknown sparse vector from a few noisy linear measurements. This estimation problem, which we refer to as sparse coding, is at the heart of the field of compressed sensing, revealing that under sparsity assumptions it is possible to successfully recover a signal that sampled significantly below the Nyquist rate [1, 2]. This, in turn, led to a dramatic increase in magnetic resonance imaging (MRI) scanning session speed [3]. Another exciting application that also builds on the sparsity assumption is unsupervised representation learning, i.e., given high-dimensional input data, such as an image, finding a low-dimensional representation that captures the intrinsic underlying structure in the input [4, 5, 6]. These representations are often used in image restoration tasks to effectively remove noise (denoising) [7, 8], fill-in missing pixels (inpainting) [9, 10, 11], and to achieve high quality digital zoom (super-resolution) [10, 12, 13, 14]. Sparsity also plays a key role in linear regression when given a large pool of features, to form a predictive rule that estimates an unknown response using a smaller, interpretable subset of features that manifests the strongest effects [15, 16, 17, 18]. To formalize the sparse coding problem, which is central for tackling the aforementioned applications, we consider the following linear model: b = Ax + v, where A is a matrix of size M N, the vector x is of length N, and v is a noise vector of length M. In this paper, we focus on a challenging setting in which M N, where a crucial assumption we make is that the vector x is k-sparse, i.e., it contains only k non-zero elements with k N [2, 1, 19].


Majority Vote for Distributed Differentially Private Sign Selection

arXiv.org Artificial Intelligence

Privacy-preserving data analysis has become prevailing in recent years. In this paper, we propose a distributed group differentially private majority vote mechanism for the sign selection problem in a distributed setup. To achieve this, we apply the iterative peeling to the stability function and use the exponential mechanism to recover the signs. As applications, we study the private sign selection for mean estimation and linear regression problems in distributed systems. Our method recovers the support and signs with the optimal signal-to-noise ratio as in the non-private scenario, which is better than contemporary works of private variable selections. Moreover, the sign selection consistency is justified with theoretical guarantees. Simulation studies are conducted to demonstrate the effectiveness of our proposed method.


Model-free Subsampling Method Based on Uniform Designs

arXiv.org Machine Learning

Subsampling or subdata selection is a useful approach in large-scale statistical learning. Most existing studies focus on model-based subsampling methods which significantly depend on the model assumption. In this paper, we consider the model-free subsampling strategy for generating subdata from the original full data. In order to measure the goodness of representation of a subdata with respect to the original data, we propose a criterion, generalized empirical F-discrepancy (GEFD), and study its theoretical properties in connection with the classical generalized L2-discrepancy in the theory of uniform designs. These properties allow us to develop a kind of low-GEFD data-driven subsampling method based on the existing uniform designs. By simulation examples and a real case study, we show that the proposed subsampling method is superior to the random sampling method. Moreover, our method keeps robust under diverse model specifications while other popular subsampling methods are under-performing. In practice, such a model-free property is more appealing than the model-based subsampling methods, where the latter may have poor performance when the model is misspecified, as demonstrated in our simulation studies.


Using Multivariate Linear Regression for Biochemical Oxygen Demand Prediction in Waste Water

arXiv.org Artificial Intelligence

There exist opportunities for Multivariate Linear Regression (MLR) in the prediction of Biochemical Oxygen Demand (BOD) in waste water, using the diverse water quality parameters as the input variables. The goal of this work is to examine the capability of MLR in prediction of BOD in waste water through four input variables: Dissolved Oxygen (DO), Nitrogen, Fecal Coliform and Total Coliform. The four input variables have higher correlation strength to BOD out of the seven parameters examined for the strength of correlation. Machine Learning (ML) was done with both 80% and 90% of the data as the training set and 20% and 10% as the test set respectively. MLR performance was evaluated through the coefficient of correlation (r), Root Mean Square Error (RMSE) and the percentage accuracy in prediction of BOD. The performance indices for the input variables of Dissolved Oxygen, Nitrogen, Fecal Coliform and Total Coliform in prediction of BOD are: RMSE=6.77mg/L, r=0.60 and accuracy 70.3% for training dataset of 80% and RMSE=6.74mg/L, r=0.60 and accuracy of 87.5% for training set of 90% of the dataset. It was found that increasing the percentage of the training set above 80% of the dataset improved the accuracy of the model only but did not have a significant impact on the prediction capacity of the model. The results showed that MLR model could be successfully employed in the estimation of BOD in waste water using appropriately selected input parameters.


SurvSHAP(t): Time-dependent explanations of machine learning survival models

arXiv.org Artificial Intelligence

Machine and deep learning survival models demonstrate similar or even improved time-to-event prediction capabilities compared to classical statistical learning methods yet are too complex to be interpreted by humans. Several model-agnostic explanations are available to overcome this issue; however, none directly explain the survival function prediction. In this paper, we introduce SurvSHAP(t), the first time-dependent explanation that allows for interpreting survival black-box models. It is based on SHapley Additive exPlanations with solid theoretical foundations and a broad adoption among machine learning practitioners. The proposed methods aim to enhance precision diagnostics and support domain experts in making decisions. Experiments on synthetic and medical data confirm that SurvSHAP(t) can detect variables with a time-dependent effect, and its aggregation is a better determinant of the importance of variables for a prediction than SurvLIME. SurvSHAP(t) is model-agnostic and can be applied to all models with functional output. We provide an accessible implementation of time-dependent explanations in Python at http://github.com/MI2DataLab/survshap.


Privacy Against Inference Attacks in Vertical Federated Learning

arXiv.org Artificial Intelligence

Vertical federated learning is considered, where an active party, having access to true class labels, wishes to build a classification model by utilizing more features from a passive party, which has no access to the labels, to improve the model accuracy. In the prediction phase, with logistic regression as the classification model, several inference attack techniques are proposed that the adversary, i.e., the active party, can employ to reconstruct the passive party's features, regarded as sensitive information. These attacks, which are mainly based on a classical notion of the center of a set, i.e., the Chebyshev center, are shown to be superior to those proposed in the literature. Moreover, several theoretical performance guarantees are provided for the aforementioned attacks. Subsequently, we consider the minimum amount of information that the adversary needs to fully reconstruct the passive party's features. In particular, it is shown that when the passive party holds one feature, and the adversary is only aware of the signs of the parameters involved, it can perfectly reconstruct that feature when the number of predictions is large enough. Next, as a defense mechanism, a privacy-preserving scheme is proposed that worsen the adversary's reconstruction attacks, while preserving the full benefits that VFL brings to the active party. Finally, experimental results demonstrate the effectiveness of the proposed attacks and the privacy-preserving scheme.


Weak Collocation Regression method: fast reveal hidden stochastic dynamics from high-dimensional aggregate data

arXiv.org Artificial Intelligence

Revealing hidden dynamics from the stochastic data is a challenging problem as randomness takes part in the evolution of the data. The problem becomes exceedingly complex when the trajectories of the stochastic data are absent in many scenarios. Here we present an approach to effectively modeling the dynamics of the stochastic data without trajectories based on the weak form of the Fokker-Planck (FP) equation, which governs the evolution of the density function in the Brownian process. Taking the collocations of Gaussian functions as the test functions in the weak form of the FP equation, we transfer the derivatives to the Gaussian functions and thus approximate the weak form by the expectational sum of the data. With a dictionary representation of the unknown terms, a linear system is built and then solved by the regression, revealing the unknown dynamics of the data. Hence, we name the method with the Weak Collocation Regression (WCR) method for its three key components: weak form, collocation of Gaussian kernels, and regression. The numerical experiments show that our method is flexible and fast, which reveals the dynamics within seconds in multi-dimensional problems and can be easily extended to high-dimensional data such as 20 dimensions. WCR can also correctly identify the hidden dynamics of the complex tasks with variable-dependent diffusion and coupled drift, and the performance is robust, achieving high accuracy in the case with noise added.