Regression
Conformalized Quantile Regression
Romano, Yaniv, Patterson, Evan, Candès, Emmanuel J.
Conformal prediction is a technique for constructing prediction intervals that attain valid coverage in finite samples, without making distributional assumptions. Despite this appeal, existing conformal methods can be unnecessarily conservative because they form intervals of constant or weakly varying length across the input space. In this paper we propose a new method that is fully adaptive to heteroscedasticity. It combines conformal prediction with classical quantile regression, inheriting the advantages of both. We establish a theoretical guarantee of valid coverage, supplemented by extensive experiments on popular regression datasets. We compare the efficiency of conformalized quantile regression to other conformal methods, showing that our method tends to produce shorter intervals.
Robust Federated Training via Collaborative Machine Teaching using Trusted Instances
Federated learning performs distributed model training using local data hosted by agents. It shares only model parameter updates for iterative aggregation at the server. Although it is privacy-preserving by design, federated learning is vulnerable to noise corruption of local agents, as demonstrated in the previous study on adversarial data poisoning threat against federated learning systems. Even a single noise-corrupted agent can bias the model training. In our work, we propose a collaborative and privacy-preserving machine teaching paradigm with multiple distributed teachers, to improve robustness of the federated training process against local data corruption. We assume that each local agent (teacher) have the resources to verify a small portions of trusted instances, which may not by itself be adequate for learning. In the proposed collaborative machine teaching method, these trusted instances guide the distributed agents to jointly select a compact while informative training subset from data hosted by their own. Simultaneously, the agents learn to add changes of limited magnitudes into the selected data instances, in order to improve the testing performances of the federally trained model despite of the training data corruption. Experiments on toy and real data demonstrate that our approach can identify training set bugs effectively and suggest appropriate changes to the labels. Our algorithm is a step toward trustworthy machine learning.
A deep learning approach for analyzing the composition of chemometric data
While which applies statistical and mathematical methods to process PLSR focuses on calculating the linear projections that shows the data obtained through spectroscopic techniques, in maximum correlation with the output or target variable, thus order to derive information of interest. The need for chemometric estimating a linear regression model determined by the projected analysis comes from the development of analytical coordinates. Benoudjit et al. [10] proposed linear and instruments and techniques that are capable of producing nonlinear regression methodologies which are based upon an large amount of complex data. Data collection through spectroscopic incremental routine for feature selection and using a validation technique is based on interaction of light energy of set. In [11,12] different techniques have been introduced variable wavelength with samples under test [1]. The ability to improve the results of previous method by choosing the of a sample to absorb or transmit light energy is recorded in best feature set for initializing the routine and finding a feature terms of values throughout a selected bandwidth of electromagnetic selection strategy that depends entirely on the shared spectrum. Whether it be food, pharmaceutical or information between spectral data and target variable. An textile industry, concentrations of chemical components of interesting approach to the chemometrics problems has been interest in samples are estimated through chemometric analysis.
F-measure Maximizing Logistic Regression
Okabe, Masaaki, Tsuchida, Jun, Yadohisa, Hiroshi
Logistic regression is a widely used method in several fields. When applying logistic regression to imbalanced data, for which majority classes dominate over minority classes, all class labels are estimated as `majority class.' In this article, we use an F-measure optimization method to improve the performance of logistic regression applied to imbalanced data. While many F-measure optimization methods adopt a ratio of the estimators to approximate the F-measure, the ratio of the estimators tends to have more bias than when the ratio is directly approximated. Therefore, we employ an approximate F-measure for estimating the relative density ratio. In addition, we define a relative F-measure and approximate the relative F-measure. We show an algorithm for a logistic regression weighted approximated relative to the F-measure. The experimental results using real world data demonstrated that our proposed method is an efficient algorithm to improve the performance of logistic regression applied to imbalanced data.
CrossTrainer: Practical Domain Adaptation with Loss Reweighting
Chen, Justin, Gan, Edward, Rong, Kexin, Suri, Sahaana, Bailis, Peter
Domain adaptation provides a powerful set of model training techniques given domain-specific training data and supplemental data with unknown relevance. The techniques are useful when users need to develop models with data from varying sources, of varying quality, or from different time ranges. We build CrossTrainer, a system for practical domain adaptation. CrossTrainer utilizes loss reweighting, which provides consistently high model accuracy across a variety of datasets in our empirical analysis. However, loss reweighting is sensitive to the choice of a weight hyperparameter that is expensive to tune. We develop optimizations leveraging unique properties of loss reweighting that allow CrossTrainer to output accurate models while improving training time compared to naive hyperparameter search.
Inside the Machine Learning Powering LinkedIn Recruiter Recommendation Systems
LinkedIn is one of the favorite recruiting platforms in the market. Everyday, recruiters from all over the world rely on LinkedIn to source and filter candidates for specific career opportunities. Specifically, LinkedIn Recruiter is the product that helps recruiters build and manage a talent pool that optimizes the chances of a successful hire. The effectiveness of LinkedIn Recruiter is powered by an incredibly sophisticated series of search and recommendation algorithms that leverage state of the art machine learning architectures with the pragmatism of real world systems. It's not a secret that LinkedIn has been one of the software giants that has been pushing the boundaries of machine learning research and development.
Similarity of Neural Network Representations Revisited
Kornblith, Simon, Norouzi, Mohammad, Lee, Honglak, Hinton, Geoffrey
Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network representations based on canonical correlation analysis (CCA). We show that CCA belongs to a family of statistics for measuring multivariate similarity, but that neither CCA nor any other statistic that is invariant to invertible linear transformation can measure meaningful similarities between representations of higher dimension than the number of data points. We introduce a similarity index that measures the relationship between representational similarity matrices and does not suffer from this limitation. This similarity index is equivalent to centered kernel alignment (CKA) and is also closely connected to CCA. Unlike CCA, CKA can reliably identify correspondences between representations in networks trained from different initializations.
Exploring Urban Air Quality with MAPS: Mobile Air Pollution Sensing
Mobile and ubiquitous sensing of urban air quality (AQ) has received increased attention as an economically and operationally viable means to survey atmospheric environment with high spatial-temporal resolution. A necessary and value-added step towards data-driven sustainable urban management is fine-granular AQ inference, which estimates grid-level pollutant concentrations at every instance of time using AQ data collected from fixed-location and mobile sensors. We present the Mobile Air Pollution Sensing (MAPS) framework, which consists of data preprocessing, urban feature extraction, and AQ inference. This is applied to a case study in Beijing (3,025 square km, 19 June - 16 July 2018), where PM2.5 concentrations measured by 28 fixed monitoring stations and 15 vehicles are fused to infer hourly PM2.5 concentrations in 3,025 1km-by-1km grids. Two machine learning structures, namely Deep Feature Spatial-Temporal Tree (DFeaST-Tree) and Deep Feature Spatial-Temporal Network (DFeaST-Net), are proposed to infer PM2.5 concentrations supported by 62 types of urban data that encompass geography, land use, traffic, public, and meteorology. This allows us to infer fine-granular PM2.5 concentrations based on sparse AQ measurements (less than 5% coverage) with good accuracy (SMAPE<15%, R-square>0.9), while accounting for the regional transport of air pollutants outside the study area. In-depth discussions are provided on the heterogeneity of fixed and mobile data sources, spatial coverage of mobile sensing, and importance of urban features for inferring PM2.5 concentrations.
A Distributed Method for Fitting Laplacian Regularized Stratified Models
Tuck, Jonathan, Barratt, Shane, Boyd, Stephen
Stratified models are models that depend in an arbitrary way on a set of selected categorical features, and depend linearly on the other features. In a basic and traditional formulation a separate model is fit for each value of the categorical feature, using only the data that has the specific categorical value. To this formulation we add Laplacian regularization, which encourages the model parameters for neighboring categorical values to be similar. Laplacian regularization allows us to specify one or more weighted graphs on the stratification feature values. For example, stratifying over the days of the week, we can specify that the Sunday model parameter should be close to the Saturday and Monday model parameters. The regularization improves the performance of the model over the traditional stratified model, since the model for each value of the categorical `borrows strength' from its neighbors. In particular, it produces a model even for categorical values that did not appear in the training data set. We propose an efficient distributed method for fitting stratified models, based on the alternating direction method of multipliers (ADMM). When the fitting loss functions are convex, the stratified model fitting problem is convex, and our method computes the global minimizer of the loss plus regularization; in other cases it computes a local minimizer. The method is very efficient, and naturally scales to large data sets or numbers of stratified feature values. We illustrate our method with a variety of examples.
Structural modeling using overlapped group penalties for discovering predictive biomarkers for subgroup analysis
Ma, Chong, Deng, Wenxuan, Ma, Shuangge, Liu, Ray, Galinsky, Kevin
The identification of predictive biomarkers from a large scale of covariates for subgroup analysis has attracted fundamental attention in medical research. In this article, we propose a generalized penalized regression method with a novel penalty function, for enforcing the hierarchy structure between the prognostic and predictive effects, such that a nonzero predictive effect must induce its ancestor prognostic effects being nonzero in the model. Our method is able to select useful predictive biomarkers by yielding a sparse, interpretable, and predictable model for subgroup analysis, and can deal with different types of response variable such as continuous, categorical, and time-to-event data. We show that our method is asymptotically consistent under some regularized conditions. To minimize the generalized penalized regression model, we propose a novel integrative optimization algorithm by integrating the majorization-minimization and the alternating direction method of multipliers, which is named after \texttt{smog}. The enriched simulation study and real case study demonstrate that our method is very powerful for discovering the true predictive biomarkers and identifying subgroups of patients.