Genre
A Generalized Fellegi-Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems
Sadinle, Mauricio, Fienberg, Stephen E.
Mauricio Sadinle is a Ph.D. student, Department of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213 (email: msadinle@stat.cmu.edu); and Stephen E. Fienberg is Maurice Falk University Professor of Statistics and Social Science in the Department of Statistics, the Machine Learning Department, and the Heinz College, Carnegie Mellon University (email: fien-berg@stat.cmu.edu). This research was partially supported by NSF Grants BCS-0941518 and SES-1130706 to Carnegie Mellon University, and by the Singapore National Research Foundation under its International Research Centre @ Singapore Funding Initiative and administered by the IDM Programme Office. The authors thank Rob Hall, Kristian Lum, Michael Larsen, the Associate Editor and two referees for helpful comments and suggestions on earlier versions of this paper, and Jorge A. Restrepo for providing the Colombian homicide data. An early version of this paper was written by the first author when he was affiliated to the Conflict Analysis Resource Center (CERAC) and the National University of Colombia at Bogot a. Abstract We present a probabilistic method for linking multiple datafiles. This task is not trivial in the absence of unique identifiers for the individuals recorded. This is a common scenario when linking census data to coverage measurement surveys for census coverage evaluation, and in general when multiple record-systems need to be integrated for posterior analysis. The goal of multiple record linkage is to classify the recordK -tuples coming fromK datafiles according to the different matching patterns. We use a mixture model to fit matching probabilities via maximum likelihood using the EM algorithm. We present a method to decide the recordK -tuples membership to the subsets of matching patterns and we prove its optimality. We apply our method to the integration of the three Colombian homicide record systems and perform a simulation study to explore the performance of the method under measurement error and different scenarios. The proposed method works well and opens new directions for future research. Key words and phrases: Bell number; Census undercount; Data linkage; Data matching; EM algorithm; Mixture model; Multiple systems estimation; Partially ordered set. 1 INTRODUCTION Record linkage is a widely-used technique for identifying records that refer to the same individual across different datafiles. This task is not trivial when unique identifiers are not available, and many authors have proposed probabilistic methods to deal with this problem building upon the seminal work of Newcombe et al. (1959) and Fellegi and Sunter (1969).
Fast Value Iteration for Goal-Directed Markov Decision Processes
Zhang, Nevin Lianwen, Zhang, Weihong
Planning problems where effects of actions are non-deterministic can be modeled a8 Markov decision processes. Planning problems are usually goal-directed. This paper proposes several techniques for exploiting the goal-directedness to accelerate value itera tion, a standard algorithm for solving Markov decision processes. Empirical studies have shown that the techniques can bring about significant speedups.
Independence of Causal Influence and Clique Tree Propagation
This paper explores the role of independence of causal influence (ICI) in Bayesian network inference. ICI allows one to factorize a conditional probability table into smaller pieces. We describe a method for exploiting the factorization in clique tree propagation (CTP) - the state-of-the-art exact inference algorithm for Bayesian networks. We also present empirical results showing that the resulting algorithm is significantly more efficient than the combination of CTP and previous techniques for exploiting ICI.
Region-Based Approximations for Planning in Stochastic Domains
Zhang, Nevin Lianwen, Liu, Wenju
This paper is concerned with planning in stochastic domains by means of partially observable Markov decision processes (POMDPs). POMDPs are difficult to solve. This paper identifies a subclass of POMDPs called region observable POMDPs, which are easier to solve and can be used to approximate general POMDPs to arbitrary accuracy. Keywords: planning under uncertainty, partially observable Markov decision processes, problem characteristics.
Score and Information for Recursive Exponential Models with Incomplete Data
Recursive graphical models usually underlie the statistical modelling concerning probabilistic expert systems based on Bayesian networks. This paper defines a version of these models, denoted as recursive exponential models, which have evolved by the desire to impose sophisticated domain knowledge onto local fragments of a model. Besides the structural knowledge, as specified by a given model, the statistical modelling may also include expert opinion about the values of parameters in the model. It is shown how to translate imprecise expert knowledge into approximately conjugate prior distributions. Based on possibly incomplete data, the score and the observed information are derived for these models. This accounts for both the traditional score and observed information, derived as derivatives of the log-likelihood, and the posterior score and observed information, derived as derivatives of the log-posterior distribution. Throughout the paper the specialization into recursive graphical models is accounted for by a simple example.
On Stable Multi-Agent Behavior in Face of Uncertainty
A stable joint plan should guarantee the achievement of a designer's goal in a multi-agent environment, while ensuring that deviations from the prescribed plan would be detected. We present a computational framework where stable joint plans can be studied, as well as several basic results about the representation, verification and synthesis of stable joint plans.
Sequential Thresholds: Context Sensitive Default Extensions
Default logic encounters some conceptual difficulties in representing common sense reasoning tasks. We argue that we should not try to formulate modular default rules that are presumed to work in all or most circumstances. We need to take into account the importance of the context which is continuously evolving during the reasoning process. Sequential thresholding is a quantitative counterpart of default logic which makes explicit the role context plays in the construction of a non-monotonic extension. We present a semantic characterization of generic non-monotonic reasoning, as well as the instantiations pertaining to default logic and sequential thresholding. This provides a link between the two mechanisms as well as a way to integrate the two that can be beneficial to both.
Conditional Utility, Utility Independence, and Utility Networks
We introduce a new interpretation of two related notions - conditional utility and utility independence. Unlike the traditional interpretation, the new interpretation renders the notions the direct analogues of their probabilistic counterparts. To capture these notions formally, we appeal to the notion of utility distribution, introduced in previous paper. We show that utility distributions, which have a structure that is identical to that of probability distributions, can be viewed as a special case of an additive multiattribute utility functions, and show how this special case permits us to capture the novel senses of conditional utility and utility independence. Finally, we present the notion of utility networks, which do for utilities what Bayesian networks do for probabilities. Specifically, utility networks exploit the new interpretation of conditional utility and utility independence to compactly represent a utility distribution.
Learning Bayesian Networks from Incomplete Databases
Ramoni, Marco, Sebastiani, Paola
Bayesian approaches to learn the graphical structure of Bayesian Belief Networks (BBNs) from databases share the assumption that the database is complete, that is, no entry is reported as unknown. Attempts to relax this assumption involve the use of expensive iterative methods to discriminate among different structures. This paper introduces a deterministic method to learn the graphical structure of a BBN from a possibly incomplete database. Experimental evaluations show a significant robustness of this method and a remarkable independence of its execution time from the number of missing data.
Representing Aggregate Belief through the Competitive Equilibrium of a Securities Market
Pennock, David M., Wellman, Michael P.
We consider the problem of belief aggregation: given a group of individual agents with probabilistic beliefs over a set of uncertain events, formulate a sensible consensus or aggregate probability distribution over these events. Researchers have proposed many aggregation methods, although on the question of which is best the general consensus is that there is no consensus. We develop a market-based approach to this problem, where agents bet on uncertain events by buying or selling securities contingent on their outcomes. Each agent acts in the market so as to maximize expected utility at given securities prices, limited in its activity only by its own risk aversion. The equilibrium prices of goods in this market represent aggregate beliefs. For agents with constant risk aversion, we demonstrate that the aggregate probability exhibits several desirable properties, and is related to independently motivated techniques. We argue that the market-based approach provides a plausible mechanism for belief aggregation in multiagent systems, as it directly addresses self-motivated agent incentives for participation and for truthfulness, and can provide a decision-theoretic foundation for the "expert weights" often employed in centralized pooling techniques.