Goto

Collaborating Authors

 Accuracy


Fine-Grained Car Detection for Visual Census Estimation

AAAI Conferences

Targeted socio-economic policies require an accurate understanding of a country’s demographic makeup. To that end, the United States spends more than 1 billion dollars a year gathering census data such as race, gender, education, occupation and unemployment rates. Compared to the traditional method of collecting surveys across many years which is costly and labor intensive, data-driven, machine learning-driven approaches are cheaper and faster—with the potential ability to detect trends in close to real time. In this work, we leverage the ubiquity of Google Street View images and develop a computer vision pipeline to predict income, per capita carbon emission, crime rates and other city attributes from a single source of publicly available visual data. We first detect cars in 50 million images across 200 of the largest US cities and train a model to predict demographic attributes using the detected cars. To facilitate our work, we have collected the largest and most challenging fine-grained dataset reported to date consisting of over 2600 classes of cars comprised of images from Google Street View and other web sources, classified by car experts to account for even the most subtle of visual differences. We use this data to construct the largest scale fine-grained detection system reported to date. Our prediction results correlate well with ground truth income data (r=0.82), Massachusetts department of vehicle registration, and sources investigating crime rates, income segregation, per capita carbon emission, and other market research. Finally, we learn interesting relationships between cars and neighborhoods allowing us to perform the first large scale sociological analysis of cities using computer vision techniques.


Landmark-Based Heuristics for Goal Recognition

AAAI Conferences

Automated planning can be used to efficiently recognize goals and plans from partial or full observed action sequences. In this paper, we propose goal recognition heuristics that rely on information from planning landmarks - facts or actions that must occur if a plan is to achieve a goal when starting from some initial state. We develop two such heuristics: the first estimates goal completion by considering the ratio between achieved and extracted landmarks of a candidate goal, while the second takes into account how unique each landmark is among landmarks for all candidate goals. We empirically evaluate these heuristics over both standard goal/plan recognition problems, and a set of very large problems. We show that our heuristics can recognize goals more accurately, and run orders of magnitude faster, than the current state-of-the-art.


A Framework of Online Learning with Imbalanced Streaming Data

AAAI Conferences

A challenge for mining large-scale streaming data overlooked by most existing studies on online learning is the skew-distribution of examples over different classes. Many previous works have considered cost-sensitive approaches in an online setting for streaming data, where fixed costs are assigned to different classes, or ad-hoc costs are adapted based on the distribution of data received so far. However, it is not necessary for them to achieve optimal performance in terms of the measures suited for imbalanced data, such as F-measure, area under ROC curve (AUROC), area under precision and recall curve (AUPRC). This work proposes a general framework for online learning with imbalanced streaming data, where examples are coming sequentially and models are updated accordingly on-the-fly. By simultaneously learning multiple classifiers with different cost vectors, the proposed method can be adopted for different target measures for imbalanced data, including F-measure, AUROC and AUPRC. Moreover, we present a rigorous theoretical justification of the proposed framework for the F-measure maximization. Our empirical studies demonstrate the competitive if not better performance of the proposed method compared to previous cost-sensitive and resampling based online learning algorithms and those that are designed for optimizing certain measures.


Poisson Sum-Product Networks: A Deep Architecture for Tractable Multivariate Poisson Distributions

AAAI Conferences

Multivariate count data are pervasive in science in the form of histograms, contingency tables and others. Previous work on modeling this type of distributions do not allow for fast and tractable inference. In this paper we present a novel Poisson graphical model, the first based on sum product networks, called PSPN, allowing for positive as well as negative dependencies. We present algorithms for learning tree PSPNs from data as well as for tractable inference via symbolic evaluation. With these, information-theoretic measures such as entropy, mutual information, and distances among count variables can be computed without resorting to approximations. Additionally, we show a connection between PSPNs and LDA, linking the structure of tree PSPNs to a hierarchy of topics. The experimental results on several synthetic and real world datasets demonstrate that PSPN often outperform state-of-the-art while remaining tractable.


Recovering True Classifier Performance in Positive-Unlabeled Learning

AAAI Conferences

A common approach in positive-unlabeled learning is to train a classification model between labeled and unlabeled data. This strategy is in fact known to give an optimal classifier under mild conditions; however, it results in biased empirical estimates of the classifier performance. In this work, we show that the typically used performance measures such as the receiver operating characteristic curve, or the precision recall curve obtained on such data can be corrected with the knowledge of class priors; i.e., the proportions of the positive and negative examples in the unlabeled data. We extend the results to a noisy setting where some of the examples labeled positive are in fact negative and show that the correction also requires the knowledge of the proportion of noisy examples in the labeled positives. Using state-of-the-art algorithms to estimate the positive class prior and the proportion of noise, we experimentally evaluate two correction approaches and demonstrate their efficacy on real-life data.


Multitask Dyadic Prediction and Its Application in Prediction of Adverse Drug-Drug Interaction

AAAI Conferences

Adverse drug-drug interactions (DDIs) remain a leading cause of morbidity and mortality around the world. Identifying potential DDIs during the drug design process is critical in guiding targeted clinical drug safety testing. Although detection of adverse DDIs is conducted during Phase IV clinical trials, there are still a large number of new DDIs founded by accidents after the drugs were put on market. With the arrival of big data era, more and more pharmaceutical research and development data are becoming available, which provides an invaluable resource for digging insights that can potentially be leveraged in early prediction of DDIs. Many computational approaches have been proposed in recent years for DDI prediction. However, most of them focused on binary prediction (with or without DDI), despite the fact that each DDI is associated with a different type. Predicting the actual DDI type will help us better understand the DDI mechanism and identify proper ways to prevent it. In this paper, we formulate the DDI type prediction problem as a multitask dyadic regression problem, where the prediction of each specific DDI type is treated as a task. Compared with conventional matrix completion approaches which can only impute the missing entries in the DDI matrix, our approach can directly regress those dyadic relationships (DDIs) and thus can be extend to new drugs more easily. We developed an effective proximal gradient method to solve the problem. Evaluation on real world datasets is presented to demonstrate the effectiveness of the proposed approach.


Predicting Soccer Highlights from Spatio-Temporal Match Event Streams

AAAI Conferences

Sports broadcasters are continuously seeking to make their live coverages of soccer matches more attractive. A recent innovation is the “highlight channel,” which shows the most interesting events from multiple matches played at the same time. However, switching between matches at the right time is challenging in fast-paced sports like soccer, where interesting situations often evolve as quickly as they disappear again. This paper presents the POGBA algorithm for automatically predicting highlights in soccer matches, which is an important task that has not yet been addressed. POGBA leverages spatio-temporal event streams collected during matches to predict the probability that a particular game state will lead to a goal. An empirical evaluation on a real-world dataset shows that POGBA outperforms the baseline algorithms in terms of both precision and recall.


ICU Mortality Prediction: A Classification Algorithm for Imbalanced Datasets

AAAI Conferences

Determining mortality risk is important for critical decisions in Intensive Care Units (ICU). The need for machine learning models that provide accurate patient-specific prediction of mortality is well recognized. We present a new algorithm for ICU mortality prediction that is designed to address the problem of imbalance, which occurs, in the context of binary classification, when one of the two classes is significantly under--represented in the data. We take a fundamentally new approach in exploiting the class imbalance through a feature transformation such that the transformed features are easier to classify. Hypothesis testing is used for classification with a test statistic that follows the distribution of the difference of two chi-squared random variables, for which there are no analytic expressions and we derive an accurate approximation. Experiments on a benchmark dataset of 4000 ICU patients show that our algorithm surpasses the best competing methods for mortality prediction.


Treatment Effect Estimation with Data-Driven Variable Decomposition

AAAI Conferences

One fundamental problem in causal inference is the treatment effect estimation in observational studies when variables are confounded. Control for confounding effect is generally handled by propensity score. But it treats all observed variables as confounders and ignores the adjustment variables, which have no influence on treatment but are predictive of the outcome. Recently, it has been demonstrated that the adjustment variables are effective in reducing the variance of the estimated treatment effect. However, how to automatically separate the confounders and adjustment variables in observational studies is still an open problem, especially in the scenarios of high dimensional variables, which are common in big data era. In this paper, we propose a Data-Driven Variable Decomposition (D$^2$VD) algorithm, which can 1) automatically separate confounders and adjustment variables with a data driven approach, and 2) simultaneously estimate treatment effect in observational studies with high dimensional variables. Under standard assumptions, we show experimentally that the proposed D$^2$VD algorithm can automatically separate the variables precisely, and estimate treatment effect more accurately and with tighter confidence intervals than the state-of-the-art methods on both synthetic data and real online advertising dataset.


A Leukocyte Detection Technique in Blood Smear Images Using Plant Growth Simulation Algorithm

AAAI Conferences

For quite some time, the analysis of leukocyte images has drawn significant attention from the fields of medicine and computer vision alike where various techniques have been used to automate the manual analysis and classification of such images. Analysing such samples manually for detecting leukocytes is time-consuming and prone to error as the cells have different morphological features. Therefore, in order to automate and optimize the process, the nature-inspired Plant Growth Simulation Algorithm (PGSA) has been applied in this paper. An automated detection technique of white blood cells embedded in obscured, stained and smeared images of blood samples has been presented in this paper which is based on a random bionic algorithm and makes use of a fitness function that measures the similarity of the generated candidate solution to an actual leukocyte. As the proposed algorithm proceeds the set of candidate solutions evolves, guaranteeing their fit with the actual leukocytes outlined in the edge map of the image. The experimental results of the stained images and the empirical results reported validate the higher precision and sensitivity of the proposed method than the existing methods. Further, the proposed method reduces the feasible sets of candidate points in each iteration, thereby decreasing the required run time of load flow, objective function evaluation, thus reaching the goal state in minimum time and within the desired constraints.