Accuracy
Modeling Stated Preference for Mobility-on-Demand Transit: A Comparison of Machine Learning and Logit Models
Zhao, Xilei, Yan, Xiang, Yu, Alan, Van Hentenryck, Pascal
Logit models are usually applied when studying individual travel behavior, i.e., to predict travel mode choice and to gain behavioral insights on traveler preferences. Recently, some studies have applied machine learning to model travel mode choice and reported higher out-of-sample prediction accuracy than conventional logit models (e.g., multinomial logit). However, there has not been a comprehensive comparison between logit models and machine learning that covers both prediction and behavioral analysis. This paper aims at addressing this gap by examining the key differences in model development, evaluation, and behavioral interpretation between logit and machine-learning models for travel-mode choice modeling. To complement the theoretical discussions, we also empirically evaluated the two approaches on stated-preference survey data for a new type of transit system integrating high-frequency fixed routes and micro-transit. The results show that machine learning can produce significantly higher predictive accuracy than logit models and are better at capturing the nonlinear relationships between trip attributes and mode-choice outcomes. On the other hand, compared to the multinomial logit model, the best-performing machine-learning model, the random forest model, produces less reasonable behavioral outputs (i.e. marginal effects and elasticities) when they were computed from a standard approach. By introducing some behavioral constraints into the computation of behavioral outputs from a random forest model, however, we obtained better results that are somewhat comparable with the multinomial logit model. We believe that there is great potential in merging ideas from machine learning and conventional statistical methods to develop refined models for travel-behavior research and suggest some possible research directions.
Predicting Demographics, Moral Foundations, and Human Values from Digital Behaviors
Kalimeri, Kyriaki, Beiro, Mariano G., Delfino, Matteo, Raleigh, Robert, Cattuto, Ciro
Personal electronic devices such as smartphones give access to a broad range of behavioral signals that can be used to learn about the characteristics and preferences of individuals. In this study we explore the connection between demographic and psychological attributes and digital records for a cohort of 7,633 people, closely representative of the US population with respect to gender, age, geographical distribution, education, and income. We collected self-reported assessments on validated psychometric questionnaires based on both the Moral Foundations and Basic Human Values theories, and combined this information with passively-collected multi-modal digital data from web browsing behavior, smartphone usage and demographic data. Then, we designed a machine learning framework to infer both the demographic and psychological attributes from the behavioral data. In a cross-validated setting, our model is found to predict demographic attributes with good accuracy (weighted AUC scores of 0.90 for gender, 0.71 for age, 0.74 for ethnicity). Our weighted AUC scores for Moral Foundation attributes (0.66) and Human Values attributes (0.60) suggest that accurate prediction of complex psychometric attributes is more challenging but feasible. This connection might prove useful for designing personalized services, communication strategies, and interventions, and can be used to sketch a portrait of people with similar worldviews.
Effective Learning of Probabilistic Models for Clinical Predictions from Longitudinal Data
Such information includes: the database in modern hospital systems, usually known as Electronic Health Records (EHR), which store the patients' diagnosis, medication, laboratory test results, medical image data, etc.; information on various health behaviors tracked and stored by wearable devices, ubiquitous sensors and mobile applications, such as the smoking status, alcoholism history, exercise level, sleeping conditions, etc.; information collected by census or various surveys regarding sociodemographic factors of the target cohort; and information on people's mental health inferred from their social media activities or social networks such as Twitter, Facebook, etc. These health-related data come from heterogeneous sources, describe assorted aspects of the individual's health conditions. Such data is rich in structure and information which has great research potentials for revealing unknown medical knowledge about genomic epidemiology, disease developments and correlations, drug discoveries, medical diagnosis, mental illness prevention, health behavior adaption, etc. In real-world problems, the number of features relating to a certain health condition could grow exponentially with the development of new information techniques for collecting and measuring data. To reveal the causal influence between various factors and a certain disease or to discover the correlations among diseases from data at such a tremendous scale, requires the assistance of advanced information technology such as data mining, machine learning, text mining, etc. Machine learning technology not only provides a way for learning qualitative relationships among features and patients, but also the quantitative parameters regarding the strength of such correlations.
Efficient Online Hyperparameter Optimization for Kernel Ridge Regression with Applications to Traffic Time Series Prediction
Zhan, Hongyuan, Gomes, Gabriel, Li, Xiaoye S., Madduri, Kamesh, Wu, Kesheng
Modern sensors generate large amounts of timestamped measurement data. These data sets are critical in a wide range of applications including traffic flow prediction, transportation management, GPS navigation, and city planning. Machine learning-based prediction algorithms typically adjust their parameters automatically based on the data, but also require users to set additional parameters, known as hyperparameters. For example, in a kernel-based regression model, the (ordinary) parameters are the regression weights, whereas the hyperparameters include the kernel scales and regularization constants. These hyperparameters have a strong influence on the prediction accuracy. Often, their values are set based on past experience or through time-consuming grid searches. In applications where the characteristics of the data change, such as unusual traffic pattern due to upcoming concert events, these hyperparameters have to be adjusted dynamically in order to maintain prediction quality. In this paper, we use the term hyperparameter learning, hyperparameter optimization, and hyperparameter selection/tuning interchangeably, referring to the process of configuring the model specification before model fitting.
A Neural Network Framework for Fair Classifier
Machine learning models are extensively being used in decision making, especially for prediction tasks. These models could be biased or unfair towards a specific sensitive group either of a specific race, gender or age. Researchers have put efforts into characterizing a particular definition of fairness and enforcing them into the models. In this work, mainly we are concerned with the following three definitions, Disparate Impact, Demographic Parity and Equalized Odds. Researchers have shown that Equalized Odds cannot be satisfied in calibrated classifiers unless the classifier is perfect. Hence the primary challenge is to ensure a degree of fairness while guaranteeing as much accuracy as possible. Fairness constraints are complex and need not be convex. Incorporating them into a machine learning algorithm is a significant challenge. Hence, many researchers have tried to come up with a surrogate loss which is convex in order to build fair classifiers. Besides, certain papers try to build fair representations by preprocessing the data, irrespective of the classifier used. Such methods, not only require a lot of unrealistic assumptions but also require human engineered analytical solutions to build a machine learning model. We instead propose an automated solution which is generalizable over any fairness constraint. We use a neural network which is trained on batches and directly enforces the fairness constraint as the loss function without modifying it further. We have also experimented with other complex performance measures such as H-mean loss, Q-mean-loss, F-measure; without the need for any surrogate loss functions. Our experiments prove that the network achieves similar performance as state of the art. Thus, one can just plug-in appropriate loss function as per required fairness constraint and performance measure of the classifier and train a neural network to achieve that.
META-DES.H: a dynamic ensemble selection technique using meta-learning and a dynamic weighting approach
Cruz, Rafael M. O., Sabourin, Robert, Cavalcanti, George D. C.
In Dynamic Ensemble Selection (DES) techniques, only the most competent classifiers are selected to classify a given query sample. Hence, the key issue in DES is how to estimate the competence of each classifier in a pool to select the most competent ones. In order to deal with this issue, we proposed a novel dynamic ensemble selection framework using meta-learning, called META-DES. The framework is divided into three steps. In the first step, the pool of classifiers is generated from the training data. In the second phase the meta-features are computed using the training data and used to train a meta-classifier that is able to predict whether or not a base classifier from the pool is competent enough to classify an input instance. In this paper, we propose improvements to the training and generalization phase of the META-DES framework. In the training phase, we evaluate four different algorithms for the training of the meta-classifier. For the generalization phase, three combination approaches are evaluated: Dynamic selection, where only the classifiers that attain a certain competence level are selected; Dynamic weighting, where the meta-classifier estimates the competence of each classifier in the pool, and the outputs of all classifiers in the pool are weighted based on their level of competence; and a hybrid approach, in which first an ensemble with the most competent classifiers is selected, after which the weights of the selected classifiers are estimated in order to be used in a weighted majority voting scheme. Experiments are carried out on 30 classification datasets. Experimental results demonstrate that the changes proposed in this paper significantly improve the recognition accuracy of the system in several datasets.
Effective Resistance-based Germination of Seed Sets for Community Detection
Eskreis-Winkler, Jonathan, Kondor, Risi
Community detection is, at its core, an attempt to attach an interpretable function to an otherwise indecipherable form. The importance of labeling communities has obvious implications for identifying clusters in social networks, but it has a number of equally relevant applications in product recommendations, biological systems, and many forms of classification. The local variety of community detection starts with a small set of labeled seed nodes, and aims to estimate the community containing these nodes. One of the most ubiquitous methods - due to its simplicity and efficiency - is personalized PageRank. The most obvious bottleneck for deploying this form of PageRank successfully is the quality of the seeds. We introduce a "germination" stage for these seeds, where an effective resistance-based approach is used to increase the quality and number of seeds from which a community is detected. By breaking seed set expansion into a two-step process, we aim to utilize two distinct random walk-based approaches in the regimes in which they excel. In synthetic and real network data, a simple, greedy algorithm which minimizes the effective resistance diameter combined with PageRank achieves clear improvements in precision and recall over a standalone PageRank procedure.
A Mixture Model Based Defense for Data Poisoning Attacks Against Naive Bayes Spam Filters
Miller, David J., Hu, Xinyi, Xiang, Zhen, Kesidis, George
Naive Bayes spam filters are highly susceptible to data poisoning attacks. Here, known spam sources/blacklisted IPs exploit the fact that their received emails will be treated as (ground truth) labeled spam examples, and used for classifier training (or re-training). The attacking source thus generates emails that will skew the spam model, potentially resulting in great degradation in classifier accuracy. Such attacks are successful mainly because of the poor representation power of the naive Bayes (NB) model, with only a single (component) density to represent spam (plus a possible attack). We propose a defense based on the use of a mixture of NB models. We demonstrate that the learned mixture almost completely isolates the attack in a second NB component, with the original spam component essentially unchanged by the attack. Our approach addresses both the scenario where the classifier is being re-trained in light of new data and, significantly, the more challenging scenario where the attack is embedded in the original spam training set. Even for weak attack strengths, BIC-based model order selection chooses a two-component solution, which invokes the mixture-based defense. Promising results are presented on the TREC 2005 spam corpus.
The Price of Fair PCA: One Extra Dimension
Samadi, Samira, Tantipongpipat, Uthaipon, Morgenstern, Jamie, Singh, Mohit, Vempala, Santosh
We investigate whether the standard dimensionality reduction technique of PCA inadvertently produces data representations with different fidelity for two different populations. We show on several real-world data sets, PCA has higher reconstruction error on population A than on B (for example, women versus men or lower- versus higher-educated individuals). This can happen even when the data set has a similar number of samples from A and B. This motivates our study of dimensionality reduction techniques which maintain similar fidelity for A and B. We define the notion of Fair PCA and give a polynomial-time algorithm for finding a low dimensional representation of the data which is nearly-optimal with respect to this measure. Finally, we show on real-world data sets that our algorithm can be used to efficiently generate a fair low dimensional representation of the data.
Escaping the Curse of Dimensionality in Similarity Learning: Efficient Frank-Wolfe Algorithm and Generalization Bounds
High-dimensional and sparse data are commonly encountered in many applications of machine learning, such as computer vision, bioinformatics, text mining and behavioral targeting. To classify, cluster or rank data points, it is important to be able to compute semantically meaningful similarities between them. However, defining an appropriate similarity measure for a given task is often difficult as only a small and unknown subset of all features are actually relevant. For instance, in drug discovery studies, chemical compounds are typically represented by a large number of sparse features describing their 2D and 3D properties, and only a few of them play in role in determining whether the compound will bind to a particular target receptor (Leach and Gillet, 2007). In text classification and clustering, a document is often represented as a sparse bag of words, and only a small subset of the dictionary is generally useful to discriminate between documents about different topics. Another example is targeted advertising, where ads are selected based on fine-grained user history (Chen et al., 2009). Similarity and metric learning (Bellet et al., 2015) offers principled approaches to construct a taskspecific similarity measure by learning it from weakly supervised data, and has been used in many application domains. The main theme in these methods is to learn the parameters of a similarity (or distance) function such that it agrees with task-specific similarity judgments (e.g., of the form "data point x should