Regression
Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms
Jia, Ruoxi, Dao, David, Wang, Boxin, Hubis, Frances Ann, Gurel, Nezihe Merve, Li, Bo, Zhang, Ce, Spanos, Costas J., Song, Dawn
Given a data set $\mathcal{D}$ containing millions of data points and a data consumer who is willing to pay for \$$X$ to train a machine learning (ML) model over $\mathcal{D}$, how should we distribute this \$$X$ to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all $N$ data points, it requires $O(2^N)$ model evaluations for exact computation and $O(N\log N)$ for $(\epsilon, \delta)$-approximation. In this paper, we focus on one popular family of ML models relying on $K$-nearest neighbors ($K$NN). The most surprising result is that for unweighted $K$NN classifiers and regressors, the Shapley value of all $N$ data points can be computed, exactly, in $O(N\log N)$ time -- an exponential improvement on computational complexity! Moreover, for $(\epsilon, \delta)$-approximation, we are able to develop an algorithm based on Locality Sensitive Hashing (LSH) with only sublinear complexity $O(N^{h(\epsilon,K)}\log N)$ when $\epsilon$ is not too small and $K$ is not too large. We empirically evaluate our algorithms on up to $10$ million data points and even our exact algorithm is up to three orders of magnitude faster than the baseline approximation algorithm. The LSH-based approximation algorithm can accelerate the value calculation process even further. We then extend our algorithms to other scenarios such as (1) weighed $K$NN classifiers, (2) different data points are clustered by different data curators, and (3) there are data analysts providing computation who also requires proper valuation.
Maize Yield and Nitrate Loss Prediction with Machine Learning Algorithms
Shahhosseini, Mohsen, Martinez-Feria, Rafael A., Hu, Guiping, Archontoulis, Sotirios V.
Pre-season prediction of crop production outcomes such as grain yields and N losses can provide insights to stakeholders when making decisions. Simulation models can assist in scenario planning, but their use is limited because of data requirements and long run times. Thus, there is a need for more computationally expedient approaches to scale up predictions. We evaluated the potential of five machine learning (ML) algorithms as meta-models for a cropping systems simulator (APSIM) to inform future decision-support tool development. We asked: 1) How well do ML meta-models predict maize yield and N losses using pre-season information? 2) How many data are needed to train ML algorithms to achieve acceptable predictions?; 3) Which input data variables are most important for accurate prediction?; and 4) Do ensembles of ML meta-models improve prediction? The simulated dataset included more than 3 million genotype, environment and management scenarios. Random forests most accurately predicted maize yield and N loss at planting time, with a RRMSE of 14% and 55%, respectively. ML meta-models reasonably reproduced simulated maize yields but not N loss. They also differed in their sensitivities to the size of the training dataset. Across all ML models, yield prediction error decreased by 10-40% as the training dataset increased from 0.5 to 1.8 million data points, whereas N loss prediction error showed no consistent pattern. ML models also differed in their sensitivities to input variables. Averaged across all ML models, weather conditions, soil properties, management information and initial conditions were roughly equally important when predicting yields. Modest prediction improvements resulted from ML ensembles. These results can help accelerate progress in coupling simulation models and ML toward developing dynamic decision support tools for pre-season management.
Compliance Change Tracking in Business Process Services
Tamilselvam, Srikanth G, Gupta, Ankush, Agarwal, Arvind
--Regulatory compliance is an organization's adherence to laws, regulations, guidelines and specifications relevant to its business. Compliance officers responsible for maintaining adherence constantly struggle to keep up with the large amount of changes in regulatory requirements. Keeping up with the changes entail two main tasks: fetching the regulatory announcements that actually contain changes of interest, and incorporating those changes in the business process. In this paper we focus on the first task, and present a Compliance Change Tracking System, that gathers regulatory announcements from government sites, news sites, email subscriptions; classifies their importance i.e Actionability through a hierarchical classifier, and business process applicability through a multi-class classifier . Na ฤฑve Bayes, logistic regression etc.), hierarchical classification method, rule based approach, hybrid approach with various preprocessing and feature selection methods; and show that despite the richness of other models, a simple hierarchical classification with bag-of-words features works the best for Actionability classifier and multi-class logistic regression works the best for Applicability classifier . The system has been deployed in global delivery centers, and has received positive feedback from payroll compliance officers. Organizations are faced with rapidly changing regulatory policies, and ever-growing number of regulations.
Alliances and Conflict, or Conflict and Alliances? Appraising the Causal Effect of Alliances on Conflict
The deterrent effect of military alliances is well documented and widely accepted. However, such work has typically assumed that alliances are exogenous. This is problematic as alliances may simultaneously influence the probability of conflict and be influenced by the probability of conflict. Failing to account for such endogeneity produces overly simplistic theories of alliance politics and barriers to identifying the causal effect of alliances on conflict. In this manuscript, I propose a solution to this theoretical and empirical modeling challenge. Synthesizing theories of alliance formation and the alliance-conflict relationship, I innovate an endogenous theory of alliances and conflict. I then test this theory using innovative generalized joint regression models that allow me to endogenize alliance formation on the causal path to conflict. Once doing so, I ultimately find that alliances neither deter nor provoke aggression. This has significant implications for our understanding of interstate conflict and alliance politics.
Variable selection using LASSO
This is a Lasso; it is used to pick and capture animals. As a non-native English speaker, my first exposure to this word is in supervised learning. In this LASSO data science tutorial, we discuss the strengths of the Lasso logistic regression by stepping through how to apply this useful statistical method for classification problems in R and how the Lasso can be "similarly" used to pick and select input variables that are relevant to the classification problem at hand. Data analysts and data scientists use different regression methods for different kinds of analytics problems. One of the most talked-about methods is the Lasso.
Modeling Time to Open of Emails with a Latent State for User Engagement Level
Sinha, Moumita, Vinay, Vishwa, Singh, Harvineet
Email messages have been an important mode of communication, not only for work, but also for social interactions and marketing. When messages have time sensitive information, it becomes relevant for the sender to know what is the expected time within which the email will be read by the recipient. In this paper we use a survival analysis framework to predict the time to open an email once it has been received. We use the Cox Proportional Hazards (CoxPH) model that offers a way to combine various features that might affect the event of opening an email. As an extension, we also apply a mixture model (MM) approach to CoxPH that distinguishes between recipients, based on a latent state of how prone to opening the messages each individual is. We compare our approach with standard classification and regression models. While the classification model provides predictions on the likelihood of an email being opened, the regression model provides prediction of the real-valued time to open. The use of survival analysis based methods allows us to jointly model both the open event as well as the time-to-open. We experimented on a large real-world dataset of marketing emails sent in a 3-month time duration. The mixture model achieves the best accuracy on our data where a high proportion of email messages go unopened.
Towards automated symptoms assessment in mental health
Activity and motion analysis has the potential to be used as a diagnostic tool for mental disorders. However, to-date, little work has been performed in turning stratification measures of activity into useful symptom markers. The research presented in this thesis has focused on the identification of objective activity and behaviour metrics that could be useful for the analysis of mental health symptoms in the above mentioned dimensions. Particular attention is given to the analysis of objective differences between disorders, as well as identification of clinical episodes of mania and depression in bipolar patients, and deterioration in borderline personality disorder patients. A principled framework is proposed for mHealth monitoring of psychiatric patients, based on measurable changes in behaviour, represented in physical activity time series, collected via mobile and wearable devices. The framework defines methods for direct computational analysis of symptoms in disorganisation and psychomotor dimensions, as well as measures for indirect assessment of mood, using patterns of physical activity, sleep and circadian rhythms. The approach of computational behaviour analysis, proposed in this thesis, has the potential for early identification of clinical deterioration in ambulatory patients, and allows for the specification of distinct and measurable behavioural phenotypes, thus enabling better understanding and treatment of mental disorders.
Uplift Modeling for Multiple Treatments with Cost Optimization
--Uplift modeling is an emerging machine learning approach for estimating the treatment effect at an individual or subgroup level. It can be used for optimizing the performance of interventions such as marketing campaigns and product designs. Uplift modeling can be used to estimate which users are likely to benefit from a treatment and then prioritize delivering or promoting the preferred experience to those users. An important but so far neglected use case for uplift modeling is an experiment with multiple treatment groups that have different costs, such as for example when different communication channels and promotion types are tested simultaneously. In this paper, we extend standard uplift models to support multiple treatment groups with different costs. We evaluate the performance of the proposed models using both synthetic and real data. We also describe a production implementation of the approach. Uplift modeling [1]-[8] is a technique to estimate and predict the individual-level or subgroup-level causal effects of different treatments in an experiment. This type of information is useful for designing and offering a personalized experience to improve user experience, satisfaction, and engagement. Uplift modeling is therefore commonly used in areas such as marketing, customer service, and product offering. It is helpful to think about uplift modeling in the context of randomized experiments (also known as A/B testing [9]-[11]). In a typical experiment, users are randomly assigned to each treatment group and causal effects are then estimated for the population.
Predicting Eating Events in Free Living Individuals -- A Technical Report
Wang, Jiayi, Yang, Jiue-An, Nakandala, Supun, Kumar, Arun, Jankowska, Marta M.
This technical report records the experiments of applying multiple machine learning algorithms for predicting eating and food purchasing behaviors of free-living individuals. Data was collected with accelerometer, global positioning system (GPS), and body-worn cameras called SenseCam over a one week period in 81 individuals from a variety of ages and demographic backgrounds. These data were turned into minute-level features from sensors as well as engineered features that included time (e.g., time since last eating) and environmental context (e.g., distance to nearest grocery store). Algorithms include Logistic Regression, RBF-SVM, Random Forest, and Gradient Boosting. Our results show that the Gradient Boosting model has the highest mean accuracy score (0.7289) for predicting eating events before 0 to 4 minutes. For predicting food purchasing events, the RBF-SVM model (0.7395) outperforms others. For both prediction models, temporal and spatial features were important contributors to predicting eating and food purchasing events.
Towards Linearization Machine Learning Algorithms
This paper is about a machine learning approach based on the multilinear projection of an unknown function (or probability distribution) to be estimated towards a linear (or multilinear) dimensional space E'. The proposal transforms the problem of predicting the target of an observation x into a problem of determining a consensus among the k nearest neighbors of x's image within the dimensional space E'. The algorithms that concretize it allow both regression and binary classification. Implementations carried out using Scala/Spark and assessed on a dozen LIBSVM datasets have demonstrated improvements in prediction accuracies in comparison with other prediction algorithms implemented within Spark MLLib such as multilayer perceptrons, logistic regression classifiers and random forests.