Statistical Learning
CosTriage: A Cost-Aware Triage Algorithm for Bug Reporting Systems
Park, Jin-woo (Pohang University of Science and Technology (POSTECH)) | Lee, Mu-Woong (Pohang University of Science and Technology (POSTECH)) | Kim, Jinhan (Pohang University of Science and Technology (POSTECH)) | Hwang, Seung-won (Pohang University of Science and Technology (POSTECH)) | Kim, Sunghun (Hong Kong University of Science and Technology (HKUST))
"Who can fix this bug?" is an important question in bug triage to "accurately" assign developers to bug reports. To address this question, recent research treats it as a optimizing recommendation accuracy problem and proposes a solution that is essentially an instance of content-based recommendation (CBR). However, CBR is well-known to cause over-specialization, recommending only the types of bugs that each developer has solved before. This problem is critical in practice, as some experienced developers could be overloaded, and this would slow the bug fixing process. In this paper, we take two directions to address this problem: First,we reformulate the problem as an optimization problem of both accuracy and cost. Second, we adopt a content-boosted collaborative filtering (CBCF), combining an existing CBR with a collaborative filtering recommender (CF), which enhances the recommendationquality of either approach alone. However, unlike general recommendation scenarios, bug fix history is extremely sparse. Due to the nature of bug fixes, one bug is fixed by only one developer, which makes it challenging to pursue the above two directions. To address this challenge, we develop a topic-model to reduce the sparseness and enhance the quality of CBCF. Our experimental evaluation shows that our solution reduces the cost efficiently by 30% without seriously compromising accuracy.
Euclidean Heuristic Optimization
Rayner, D. Chris F. (University of Alberta) | Bowling, Michael H. (University of Alberta) | Sturtevant, Nathan R. (University of Denver)
We pose the problem of constructing good search heuristics as an optimization problem: minimizing the loss between the true distances and the heuristic estimates subject to admissibility and consistency constraints. For a well-motivated choice of loss function, we show performing this optimization is tractable. In fact, it corresponds to a recently proposed method for dimensionality reduction. We prove this optimization is guaranteed to produce admissible and consistent heuristics, generalizes and gives insight into differential heuristics, and show experimentally that it produces strong heuristics on problems from three distinct search domains.
A Tutorial on Bayesian Nonparametric Models
Gershman, Samuel J., Blei, David M.
A key problem in statistical modeling is model selection, how to choose a model at an appropriate level of complexity. This problem appears in many settings, most prominently in choosing the number ofclusters in mixture models or the number of factors in factor analysis. In this tutorial we describe Bayesian nonparametric methods, a class of methods that side-steps this issue by allowing the data to determine the complexity of the model. This tutorial is a high-level introduction to Bayesian nonparametric methods and contains several examples of their application.
Discovering Latent Strategies
Xu, Xiaoxi (University of Massachusetts Amherst)
Strategy mining is a new area of research about discovering strategies in decision-making. In this paper, we formulate the strategy-mining problem as a clustering problem, called the latent-strategy problem. In a latent-strategy problem, a corpus of data instances is given, each of which is represented by a set of features and a decision label. The inherent dependency of the decision label on the features is governed by a latent strategy. The objective is to find clusters, each of which contains data instances governed by the same strategy. Existing clustering algorithms are inappropriate to cluster dependency because they either assume feature independency (e.g., K-means) or only consider the co-occurrence of features without explicitly modeling the special dependency of the decision label on other features (e.g., Latent Dirichlet Allocation (LDA)). In this paper, we present a baseline unsupervised learning algorithm for dependency clustering. Our model-based clustering algorithm iterates between an assignment step and a minimization step to learn a mixture of decision tree models that represent latent strategies. Similar to the Expectation Maximization algorithm, our algorithm is grounded in the statistical learning theory. Different from other clustering algorithms, our algorithm is irrelevant-feature resistant and its learned clusters (modeled by decision trees) are strongly interpretable and predictive. We systematically evaluate our algorithm using a common law dataset comprised of actual cases. Experimental results show that our algorithm significantly outperforms K-means and LDA on clustering dependency.
Multiple-Instance Learning: Multiple Feature Selection on Instance Representation
Jhuo, I-Hong (National Taiwan University) | Lee, D. T. (Academia Sinica)
In multiple-Instance Learning (MIL), training class labels are attached to sets of bags composed of unlabeled instances, and the goal is to deal with classification of bags. Most previous MIL algorithms, which tackle classification problems, consider each instance as a represented feature. Although the algorithms work well in some prediction problems, considering diverse features to represent an instance may provide more significant information for learning task. Moreover, since each instance may be mapped into diverse feature spaces, encountering a large number of irrelevant or redundant features is inevitable. In this paper, we propose a method to select relevant instances and concurrently consider multiple features for each instance, which is termed as MIL-MFS. MIL-MFS is based on multiple kernel learning (MKL), and it iteratively selects the fusing multiple features for classifier training. Experimental results show that the MIL-MFS combined with multiple kernel learning can significantly improve the classification performance.
Multi-Observation Sensor Resetting Localization with Ambiguous Landmarks
Coltin, Brian (Carnegie Mellon University) | Veloso, Manuela (Carnegie Mellon University)
Successful approaches to the robot localization problem include Monte Carlo particle filters, which estimate non-parametric localization belief distributions. However, particle filters fare poorly at determining the robot's position without a good initial hypothesis. This problem has been addressed for robots that sense visual landmarks with sensor resetting, by performing sensor-based resampling when the robot is lost. For robots that make sparse, ambiguous and noisy observations, standard sensor resetting places new location hypotheses across a wide region, in positions that may be inconsistent with previous observations. We propose Multi-Observation Sensor Resetting, where observations from multiple frames are merged to generate new hypotheses more effectively. We demonstrate experimentally in the robot soccer domain on the NAO humanoid robots that Multi-Observation Sensor Resetting converges more efficiently to the robot's true position than standard sensor resetting, and is more robust to systematic vision errors.
End-User Feature Labeling via Locally Weighted Logistic Regression
Wong, Weng-Keen (Oregon State University) | Oberst, Ian (Oregon State University) | Das, Shubhomoy (Oregon State University) | Moore, Travis (Oregon State University) | Stumpf, Simone (City University London) | McIntosh, Kevin (Oregon State University) | Burnett, Margaret (Oregon State University)
Applications that adapt to a particular end user often make inaccurate predictions during the early stages when training data is limited. Although an end user can improve the learning algorithm by labeling more training data, this process is time consuming and too ad hoc to target a particular area of inaccuracy. To solve this problem, we propose a new learning algorithm based on Locally Weighted Logistic Regression for feature labeling by end users, enabling them to point out which features are important for a class, rather than provide new training instances. In our user study, the first allowing ordinary end users to freely choose features to label directly from text documents, our algorithm was more effective than others at leveraging end usersโ feature labels to improve the learning algorithm. Our results strongly suggest that allowing users to freely choose features to label is a promising method for allowing end users to improve learning algorithms effectively.
Effective End-User Interaction with Machine Learning
Amershi, Saleema (University of Washington) | Fogarty, James (University of Washington) | Kapoor, Ashish (Microsoft Research) | Tan, Desney (Microsoft Research)
End-user interactive machine learning is a promising tool for enhancing human productivity and capabilities with large unstructured data sets. Recent work has shown that we can create end-user interactive machine learning systems for specific applications. However, we still lack a generalized understanding of how to design effective end-user interaction with interactive machine learning systems. This work presents three explorations in designing for effective end-user interaction with machine learning in CueFlik, a system developed to support Web image search. These explorations demonstrate that interactions designed to balance the needs of end-users and machine learning algorithms can significantly improve the effectiveness of end-user interactive machine learning.
Modeling and Monitoring Crop Disease in Developing Countries
Quinn, John Alexander (Makerere University) | Leyton-Brown, Kevin (Associate Professor, Department of Computer Science) | Mwebaze, Ernest (Makerere University)
Information about the spread of crop disease is vital in developing countries, and as a result the governments of such countries devote scarce resources to gathering such data. Unfortunately, current surveys tend to be slow and expensive, and hence also tend to gather insufficient quantities of data. In this work we describe three general methods for improving the use of survey resources by performing data collection with mobile devices and by directing survey progress through the application of AI techniques. First, we describe a spatial disease density model based on Gaussian process ordinal regression, which offers a better representation of the disease level distribution, as compared to the statistical approaches typically applied. Second, we show how this model can be used to dynamically route survey teams to obtain the most valuable survey possible given a fixed budget. Third, we demonstrate that the diagnosis of plant disease can be automated using images taken by a camera phone, enabling data collection by survey workers with only basic training. We have applied our methods to the specific challenge of viral cassava disease monitoring in Uganda, for which we have implemented a real-time mobile survey system that will soon see practical use.
Logistic Methods for Resource Selection Functions and Presence-Only Species Distribution Models
Phillips, Steven (AT&T Labs-Research) | Elith, Jane (University of Melbourne)
In order to better protect and conserve biodiversity, ecologists use machine learning and statistics to understand how species respond to their environment and to predict how they will respond to future climate change, habitat loss and other threats. A fundamental modeling task is to estimate the probability that a given species is present in (or uses) a site, conditional on environmental variables such as precipitation and temperature. For a limited number of species, survey data consisting of both presence and absence records are available, and can be used to fit a variety of conventional classification and regression models. For most species, however, the available data consist only of occurrence records --- locations where the species has been observed. In two closely-related but separate bodies of ecological literature, diverse special-purpose models have been developed that contrast occurrence data with a random sample of available environmental conditions. The most widespread statistical approaches involve either fitting an exponential model of species' conditional probability of presence, or fitting a naive logistic model in which the random sample of available conditions is treated as absence data; both approaches have well-known drawbacks, and do not necessarily produce valid probabilities. After summarizing existing methods, we overcome their drawbacks by introducing a new scaled binomial loss function for estimating an underlying logistic model of species presence/absence. Like the Expectation-Maximization approach of Ward et al. and the method of Steinberg and Cardell, our approach requires an estimate of population prevalence, $\Pr(y=1)$, since prevalence is not identifiable from occurrence data alone. In contrast to the latter two methods, our loss function is straightforward to integrate into a variety of existing modeling frameworks such as generalized linear and additive models and boosted regression trees. We also demonstrate that approaches by Lele and Keim and by Lancaster and Imbens that surmount the identifiability issue by making parametric data assumptions do not typically produce valid probability estimates.