Flexible Models for Microclustering with Application to Entity Resolution

arXiv.org Machine Learning

Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.

Evaluating WordNet Features in Text Classification Models

AAAI Conferences

Incorporating semantic features from the WordNet lexical database is among one of the many approaches that have been tried to improve the predictive performance of text classification models. The intuition behind this is that keywords in the training set alone may not be extensive enough to enable generation of a universal model for a category, but if we incorporate the word relationships in WordNet, a more accurate model may be possible. Other researchers have previously evaluated the effectiveness of incorporating WordNet synonyms, hypernyms, and hyponyms into text classification models. Generally, they have found that improvements in accuracy using features derived from these relationships are dependent upon the nature of the text corpora from which the document collections are extracted. In this paper, we not only reconsider the role of WordNet synonyms, hypernyms, and hyponyms in text classification models, we also consider the role of WordNet meronyms and holonyms. Incorporating these WordNet relationships into a Coordinate Matching classifier, a Naive Bayes classifier, and a Support Vector Machine classifier, we evaluate our approach on six document collections extracted from the Reuters-21578, USENET, and Digi-Trad text corpora. Experimental results show that none of the WordNet relationships were effective at increasing the accuracy of the Naive Bayes classifier. Synonyms, hypernyms, and holonyms were effective at increasing the accuracy of the Coordinate Matching classifier, and hypernyms were effective at increasing the accuracy of the SVM classifier.

Using More Reasoning to Improve #SAT Solving

AAAI Conferences

Many real-world problems, including inference in Bayes Nets, can be reduced to #SAT, the problem of counting the number of models of a propositional theory. This has motivated the need for efficient #SAT solvers. Currently, such solvers utilize a modified version of DPLL that employs decomposition and caching, techniques that significantly increase the time it takes to process each node in the search space. In addition, the search space is significantly larger than when solving SAT since we must continue searching even after the first solution has been found. It has previously been demonstrated that the size of a DPLL search tree can be significantly reduced by doing more reasoning at each node. However, for SAT the reductions gained are often not worth the extra time required. In this paper we verify the hypothesis that for #SAT this balance changes. In particular, we show that additional reasoning can reduce the size of a #SAT solver's search space, that this reduction cannot always be achieved by the already utilized technique of clause learning, and that this additional reasoning can be cost effective.

Evidence and Belief

AAAI Conferences

We discuss the representation of knowledge and of belief from the viewpoint of decision theory. While the Bayesian approach enjoys general-purpose applicability and axiomatic foundations, it suffers from several drawbacks. In particular, it does not model the belief formation process, and does not relate beliefs to evidence. We survey alternative approaches, and focus on formal model of casebased prediction and case-based decisions. A formal model of belief and knowledge representation needs to address several questions. The most basic ones are: (i) how do we represent knowledge?

Towards Diagnosing Hybrid Systems

AAAI Conferences

This paper reports on the findings of an ongoing project to investigate techniques to diagnose complex dynamical systems that are modeled as hybrid systems. In particular, we examine continuous systems with embedded supervisory controllers which experience abrupt, partial or full failure of component devices. The problem we address is: given a hybrid model of system behavior, a history of executed controller actions, and a history of observations, including an observation of behavior that is aberrant relative to the model of expected behavior, determine what fault occurred to have caused the aberrant behavior. Determining a diagnosis can be cast as a search problem to find the most likely model for the data. Unfortunately, the search space is extremely large. To reduce search space size and to identify an initial set of candidate diagnoses, we propose to exploit techniques originally applied to qualitative diagnosis of continuous systems. We refine these diagnoses using parameter estimation and model fitting techniques. As a motivating case study, we have examined the problem of diagnosing NASA's Sprint AERCam, a small spherical robotic camera unit with 12 thrusters that enable both linear and rotational motion.