Goto

Collaborating Authors

 Illinois Institute of Technology


Controlling for Unobserved Confounds in Classification Using Correlational Constraints

AAAI Conferences

As statistical classifiers become integrated into real-world applications, it is important to consider not only their accuracy but also their robustness to changes in the data distribution. In this paper, we consider the case where there is an unobserved confounding variable z that influences both the features x and the class variable y. When the influence of z changes from training to testing data, we find that the classifier accuracy can degrade rapidly. In our approach, we assume that we can predict the value of z at training time with some error. The prediction for z is then fed to Pearl's back-door adjustment to build our model. Because of the attenuation bias caused by measurement error in z, standard approaches to controlling for z are ineffective. In response, we propose a method to properly control for the influence of z by first estimating its relationship with the class variable y, then updating predictions for z to match that estimated relationship. By adjusting the influence of z, we show that we can build a model that exceeds competing baselines on accuracy as well as on robustness over a range of confounding relationships.


Identifying Leading Indicators of Product Recalls from Online Reviews Using Positive Unlabeled Learning and Domain Adaptation

AAAI Conferences

Consumer protection agencies are charged with safeguarding the public from hazardous products, but the thousands of products under their jurisdiction make it challenging to identify and respond to consumer complaints quickly. In this paper, we propose a system to mine Amazon.com reviews to identify products that may pose safety or health hazards. Since labeled data for this task are scarce, our approach combines positive unlabeled learning with domain adaptation to train a classifier from consumer complaints submitted to an online government portal. We find that our approach results in an absolute F1 score improvement of 8% over the best competing baseline. Furthermore, when we apply the classifier to Amazon reviews of known recalled products, we identify safety hazard reports prior to the recall date for 45% of the products. This suggests that the system may be able to provide an early warning system to alert consumers to hazardous products before an official recall is announced.


Semi-Supervised Classifications via Elastic and Robust Embedding

AAAI Conferences

Transductive semi-supervised learning can only predict labels for unlabeled data appearing in training data, and can not predict labels for testing data never appearing in training set. To handle this out-of-sample problem, many inductive methods make a constraint such that the predicted label matrix should be exactly equal to a linear model. In practice, this constraint might be too rigid to capture the manifold structure of data. In this paper, we relax this rigid constraint and propose to use an elastic constraint on the predicted label matrix such that the manifold structure can be better explored. Moreover, since unlabeled data are often very abundant in practice and usually there are some outliers, we use a non-squared loss instead of the traditional squared loss to learn a robust model. The derived problem, although is convex, has so many nonsmooth terms, which make it very challenging to solve. In the paper, we propose an efficient optimization algorithm to solve a more general problem, based on which we find the optimal solution to the derived problem.


Robust Text Classification in the Presence of Confounding Bias

AAAI Conferences

As text classifiers become increasingly used in real-time applications, it is critical to consider not only their accuracy but also their robustness to changes in the data distribution. In this paper, we consider the case where there is a confounding variable Z that influences both the text features X and the class variable Y. For example, a classifier trained to predict the health status of a user based on their online communications may be confounded by socioeconomic variables. When the influence of Z changes from training to testing data, we find that classifier accuracy can degrade rapidly. Our approach, based on Pearl's back-door adjustment, estimates the underlying effect of a text variable on the class variable while controlling for the confounding variable. Although our goal is prediction, not causal inference, we find that such adjustments are essential to building text classifiers that are robust to confounding variables. On three diverse text classifications tasks, we find that covariate adjustment results in higher accuracy than competing baselines over a range of confounding relationships (e.g., in one setting, accuracy improves from 60% to 81%).


Active Inference and Dynamic Gaussian Bayesian Networks for Battery Optimization in Wireless Sensor Networks

AAAI Conferences

Wireless sensor networks play a major role in smart grids and smart buildings. They are not just used for sensing, but they are also used as actuating. In terms of sensing they are used to measure temperature, humidity, light, to detect motion, etc. Sensors are often operated on a battery and hence we often face a trade-off between obtaining frequent sensor readings versus maximizing their battery life. There have been several approaches to maximizing their battery life from hardware level to software level such as reducing components energy consumption, limiting node operation capabilities, using power-aware routing protocols, and adding solar energy support. In this paper, we introduce a novel approach: we model the sensor readings in a wireless network using a dynamic Gaussian Bayesian network (dGBn) whose structure is automatically learned from data. dGBn allows us to integrate information across sensors and infer missing readings more accurately. Through active inference for dGBns, we are able to actively choose which sensors should be pulled for a reading and which ones can stay in a power-saving mode at each time step, maximizing prediction accuracy while staying within the budgetary constraints on battery consumption.


Using Matched Samples to Estimate the Effects of Exercise on Mental Health via Twitter

AAAI Conferences

Recent work has demonstrated the value of social media monitoring for health surveillance (e.g., tracking influenza or depression rates). It is an open question whether such data can be used to make causal inferences (e.g., determining which activities lead to increased depression rates). Even in traditional, restricted domains, estimating causal effects from observational data is highly susceptible to confounding bias. In this work, we estimate the effect of exercise on mental health from Twitter, relying on statistical matching methods to reduce confounding bias. We train a text classifier to estimate the volume of a user's tweets expressing anxiety, depression, or anger, then compare two groups: those who exercise regularly (identified by their use of physical activity trackers like Nike+), and a matched control group. We find that those who exercise regularly have significantly fewer tweets expressing depression or anxiety; there is no significant difference in rates of tweets expressing anger. We additionally perform a sensitivity analysis to investigate how the many experimental design choices in such a study impact the final conclusions, including the quality of the classifier and the construction of the control group.


Anytime Active Learning

AAAI Conferences

A common bottleneck in deploying supervised learning systems is collecting human-annotated examples. In many domains, annotators form an opinion about the label of an example incrementally -- e.g., each additional word read from a document or each additional minute spent inspecting a video helps inform the annotation. In this paper, we investigate whether we can train learning systems more efficiently by requesting an annotation before inspection is fully complete -- e.g., after reading only 25 words of a document. While doing so may reduce the overall annotation time, it also introduces the risk that the annotator might not be able to provide a label if interrupted too early. We propose an anytime active learning approach that optimizes the annotation time and response rate simultaneously. We conduct user studies on two document classification datasets and develop simulated annotators that mimic the users. Our simulated experiments show that anytime active learning outperforms several baselines on these two datasets. For example, with an annotation budget of one hour, training a classifier by annotating the first 25 words of each document reduces classification error by 17% over annotating the first 100 words of each document.


Automatic Identification of Conceptual Metaphors With Limited Knowledge

AAAI Conferences

Full natural language understanding requires identifying and analyzing the meanings of metaphors, which are ubiquitous in both text and speech. Over the last thirty years, linguistic metaphors have been shown to be based on more general conceptual metaphors, partial semantic mappings between disparate conceptual domains. Though some achievements have been made in identifying linguistic metaphors over the last decade or so, little work has been done to date on automatically identifying conceptual metaphors. This paper describes research on identifying conceptual metaphors based on corpus data. Our method uses as little background knowledge as possible, to ease transfer to new languages and to mini- mize any bias introduced by the knowledge base construction process. The method relies on general heuristics for identifying linguistic metaphors and statistical clustering (guided by Wordnet) to form conceptual metaphor candidates. Human experiments show the system effectively finds meaningful conceptual metaphors.