Regression
Predicting Depressive Symptom Severity through Individuals' Nearby Bluetooth Devices Count Data Collected by Mobile Phones: A Preliminary Longitudinal Study
Zhang, Yuezhou, Folarin, Amos A, Sun, Shaoxiong, Cummins, Nicholas, Ranjan, Yatharth, Rashid, Zulqarnain, Conde, Pauline, Stewart, Callum, Laiou, Petroula, Matcham, Faith, Oetzmann, Carolin, Lamers, Femke, Siddi, Sara, Simblett, Sara, Rintala, Aki, Mohr, David C, Myin-Germeys, Inez, Wykes, Til, Haro, Josep Maria, Pennix, Brenda WJH, Narayan, Vaibhav A, Annas, Peter, Hotopf, Matthew, Dobson, Richard JB
The Bluetooth sensor embedded in mobile phones provides an unobtrusive, continuous, and cost-efficient means to capture individuals' proximity information, such as the nearby Bluetooth devices count (NBDC). The continuous NBDC data can partially reflect individuals' behaviors and status, such as social connections and interactions, working status, mobility, and social isolation and loneliness, which were found to be significantly associated with depression by previous survey-based studies. This paper aims to explore the NBDC data's value in predicting depressive symptom severity as measured via the 8-item Patient Health Questionnaire (PHQ-8). The data used in this paper included 2,886 bi-weekly PHQ-8 records collected from 316 participants recruited from three study sites in the Netherlands, Spain, and the UK as part of the EU RADAR-CNS study. From the NBDC data two weeks prior to each PHQ-8 score, we extracted 49 Bluetooth features, including statistical features and nonlinear features for measuring periodicity and regularity of individuals' life rhythms. Linear mixed-effect models were used to explore associations between Bluetooth features and the PHQ-8 score. We then applied hierarchical Bayesian linear regression models to predict the PHQ-8 score from the extracted Bluetooth features. A number of significant associations were found between Bluetooth features and depressive symptom severity. Compared with commonly used machine learning models, the proposed hierarchical Bayesian linear regression model achieved the best prediction metrics, R2= 0.526, and root mean squared error (RMSE) of 3.891. Bluetooth features can explain an extra 18.8% of the variance in the PHQ-8 score relative to the baseline model without Bluetooth features (R2=0.338, RMSE = 4.547).
Establishing phone-pair co-usage by comparing mobility patterns
Bosma, Wauter, Dalm, Sander, van Eijk, Erwin, Harchaoui, Rachid el, Rijgersberg, Edwin, Tops, Hannah Tereza, Veenstra, Alle, Ypma, Rolf
In forensic investigations it is often of value to establish whether two phones were used by the same person during a given time period. We present a method that uses time and location of cell tower registrations of mobile phones to assess the strength of evidence that any pair of phones were used by the same person. The method is transparent as it uses logistic regression to discriminate between the hypotheses of same and different user, and a standard kernel density estimation to quantify the weight of evidence in terms of a likelihood ratio. We further add to previous theoretical work by training and validating our method on real world data, paving the way for application in practice. The method shows good performance under different modeling choices and robustness under lower quantity or quality of data. We discuss practical usage in court.
Create A Machine Learning Model using GridDB
In this tutorial, we will build a trivial linear regression model with the data stored in GridDB. We will begin with GridDB's python-connector to insert and access the data. Afterwards, we will see how to retrieve and convert the data using pandas and numpy. In the end, we will train and visualize our regression model using scikit-learn and matplotlib. The following tutorial is carried out on Ubuntu Operating system (v.
Variational Inference in high-dimensional linear regression
Mukherjee, Sumit, Sen, Subhabrata
We study high-dimensional Bayesian linear regression with product priors. Using the nascent theory of non-linear large deviations (Chatterjee and Dembo,2016), we derive sufficient conditions for the leading-order correctness of the naive mean-field approximation to the log-normalizing constant of the posterior distribution. Subsequently, assuming a true linear model for the observed data, we derive a limiting infinite dimensional variational formula for the log normalizing constant of the posterior. Furthermore, we establish that under an additional "separation" condition, the variational problem has a unique optimizer, and this optimizer governs the probabilistic properties of the posterior distribution. We provide intuitive sufficient conditions for the validity of this "separation" condition. Finally, we illustrate our results on concrete examples with specific design matrices.
Influence Based Defense Against Data Poisoning Attacks in Online Learning
Seetharaman, Sanjay, Malaviya, Shubham, KV, Rosni, Shukla, Manish, Lodha, Sachin
Data poisoning is a type of adversarial attack on training data where an attacker manipulates a fraction of data to degrade the performance of machine learning model. Therefore, applications that rely on external data-sources for training data are at a significantly higher risk. There are several known defensive mechanisms that can help in mitigating the threat from such attacks. For example, data sanitization is a popular defensive mechanism wherein the learner rejects those data points that are sufficiently far from the set of training instances. Prior work on data poisoning defense primarily focused on offline setting, wherein all the data is assumed to be available for analysis. Defensive measures for online learning, where data points arrive sequentially, have not garnered similar interest. In this work, we propose a defense mechanism to minimize the degradation caused by the poisoned training data on a learner's model in an online setup. Our proposed method utilizes an influence function which is a classic technique in robust statistics. Further, we supplement it with the existing data sanitization methods for filtering out some of the poisoned data points. We study the effectiveness of our defense mechanism on multiple datasets and across multiple attack strategies against an online learner.
Optimal Dynamic Regret in Exp-Concave Online Learning
We consider the problem of the Zinkevich (2003)-style dynamic regret minimization in online learning with exp-concave losses. We show that whenever improper learning is allowed, a Strongly Adaptive online learner achieves the dynamic regret of $\tilde O(d^{3.5}n^{1/3}C_n^{2/3} \vee d\log n)$ where $C_n$ is the total variation (a.k.a. path length) of the an arbitrary sequence of comparators that may not be known to the learner ahead of time. Achieving this rate was highly nontrivial even for squared losses in 1D where the best known upper bound was $O(\sqrt{nC_n} \vee \log n)$ (Yuan and Lamperski, 2019). Our new proof techniques make elegant use of the intricate structures of the primal and dual variables imposed by the KKT conditions and could be of independent interest. Finally, we apply our results to the classical statistical problem of locally adaptive non-parametric regression (Mammen, 1991; Donoho and Johnstone, 1998) and obtain a stronger and more flexible algorithm that do not require any statistical assumptions or any hyperparameter tuning.
Grouped Feature Importance and Combined Features Effect Plot
Au, Quay, Herbinger, Julia, Stachl, Clemens, Bischl, Bernd, Casalicchio, Giuseppe
Interpretable machine learning has become a very active area of research due to the rising popularity of machine learning algorithms and their inherently challenging interpretability. Most work in this area has been focused on the interpretation of single features in a model. However, for researchers and practitioners, it is often equally important to quantify the importance or visualize the effect of feature groups. To address this research gap, we provide a comprehensive overview of how existing model-agnostic techniques can be defined for feature groups to assess the grouped feature importance, focusing on permutation-based, refitting, and Shapley-based methods. We also introduce an importance-based sequential procedure that identifies a stable and well-performing combination of features in the grouped feature space. Furthermore, we introduce the combined features effect plot, which is a technique to visualize the effect of a group of features based on a sparse, interpretable linear combination of features. We used simulation studies and a real data example from computational psychology to analyze, compare, and discuss these methods.
Feature Inference Attack on Model Predictions in Vertical Federated Learning
Luo, Xinjian, Wu, Yuncheng, Xiao, Xiaokui, Ooi, Beng Chin
Federated learning (FL) is an emerging paradigm for facilitating multiple organizations' data collaboration without revealing their private data to each other. Recently, vertical FL, where the participating organizations hold the same set of samples but with disjoint features and only one organization owns the labels, has received increased attention. This paper presents several feature inference attack methods to investigate the potential privacy leakages in the model prediction stage of vertical FL. The attack methods consider the most stringent setting that the adversary controls only the trained vertical FL model and the model predictions, relying on no background information. We first propose two specific attacks on the logistic regression (LR) and decision tree (DT) models, according to individual prediction output. We further design a general attack method based on multiple prediction outputs accumulated by the adversary to handle complex models, such as neural networks (NN) and random forest (RF) models. Experimental evaluations demonstrate the effectiveness of the proposed attacks and highlight the need for designing private mechanisms to protect the prediction outputs in vertical FL.
Certifiably Polynomial Algorithm for Best Group Subset Selection
Zhang, Yanhang, Zhu, Junxian, Zhu, Jin, Wang, Xueqin
Best group subset selection aims to choose a small part of non-overlapping groups to achieve the best interpretability on the response variable. It is practically attractive for group variable selection; however, due to the computational intractability in high dimensionality setting, it doesn't catch enough attention. To fill the blank of efficient algorithms for best group subset selection, in this paper, we propose a group-splicing algorithm that iteratively detects effective groups and excludes the helpless ones. Moreover, coupled with a novel Bayesian group information criterion, an adaptive algorithm is developed to determine the true group subset size. It is certifiable that our algorithms enable identifying the optimal group subset in polynomial time under mild conditions. We demonstrate the efficiency and accuracy of our proposal by comparing state-of-the-art algorithms on both synthetic and real-world datasets.
Robust Kernel-based Distribution Regression
Yu, Zhan, Ho, Daniel W. C., Zhou, Ding-Xuan
Regularization schemes for regression have been widely studied in learning theory and inverse problems. In this paper, we study distribution regression (DR) which involves two stages of sampling, and aims at regressing from probability measures to real-valued responses over a reproducing kernel Hilbert space (RKHS). Recently, theoretical analysis on DR has been carried out via kernel ridge regression and several learning behaviors have been observed. However, the topic has not been explored and understood beyond the least square based DR. By introducing a robust loss function $l_{\sigma}$ for two-stage sampling problems, we present a novel robust distribution regression (RDR) scheme. With a windowing function $V$ and a scaling parameter $\sigma$ which can be appropriately chosen, $l_{\sigma}$ can include a wide range of popular used loss functions that enrich the theme of DR. Moreover, the loss $l_{\sigma}$ is not necessarily convex, hence largely improving the former regression class (least square) in the literature of DR. The learning rates under different regularity ranges of the regression function $f_{\rho}$ are comprehensively studied and derived via integral operator techniques. The scaling parameter $\sigma$ is shown to be crucial in providing robustness and satisfactory learning rates of RDR.