Learning Graphical Models
Distantly Labeling Data for Large Scale Cross-Document Coreference
Singh, Sameer, Wick, Michael, McCallum, Andrew
Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on ``distantly-labeling'' a data set from which we can train a discriminative cross-document coreference model. In particular we build a dataset of more than a million people mentions extracted from 3.5 years of New York Times articles, leverage Wikipedia for distant labeling with a generative model (and measure the reliability of such labeling); then we train and evaluate a conditional random field coreference model that has factors on cross-document entities as well as mention-pairs. This coreference model obtains high accuracy in resolving mentions and entities that are not present in the training data, indicating applicability to non-Wikipedia data. Given the large amount of data, our work is also an exercise demonstrating the scalability of our approach.
Some distance bounds of branching processes and their diffusion limits
Kammerer, Niels B., Stummer, Wolfgang
We compute exact values respectively bounds of "distances" - in the sense of (transforms of) power divergences and relative entropy - between two discrete-time Galton-Watson branching processes with immigration GWI for which the offspring as well as the immigration is arbitrarily Poisson-distributed (leading to arbitrary type of criticality). Implications for asymptotic distinguishability behaviour in terms of contiguity and entire separation of the involved GWI are given, too. Furthermore, we determine the corresponding limit quantities for the context in which the two GWI converge to Feller-type branching diffusion processes, as the time-lags between observations tend to zero. Some applications to (static random environment like) Bayesian decision making and Neyman-Pearson testing are presented as well.
Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes
Petrik, Marek, Taylor, Gavin, Parr, Ron, Zilberstein, Shlomo
Approximate dynamic programming has been used successfully in a large variety of domains, but it relies on a small set of provided approximation features to calculate solutions reliably. Large and rich sets of features can cause existing algorithms to overfit because of a limited number of samples. We address this shortcoming using $L_1$ regularization in approximate linear programming. Because the proposed method can automatically select the appropriate richness of features, its performance does not degrade with an increasing number of features. These results rely on new and stronger sampling bounds for regularized approximate linear programs. We also propose a computationally efficient homotopy method. The empirical evaluation of the approach shows that the proposed method performs well on simple MDPs and standard benchmark problems.
Using Local Alignments for Relation Recognition
Katrenko, S., Adriaans, P. W., van Someren, M.
Aiming at accurate recognition of relations, we introduce local alignment kernels and explore various possibilities of using them for this task. We give a definition of a local alignment (LA) kernel based on the Smith-Waterman score as a sequence similarity measure and proceed with a range of possibilities for computing similarity between elements of sequences. We show how distributional similarity measures obtained from unlabeled data can be incorporated into the learning task as semantic knowledge. Our experiments suggest that the LA kernel yields promising results on various biomedical corpora outperforming two baselines by a large margin. Additional series of experiments have been conducted on the data sets of seven general relation types, where the performance of the LA kernel is comparable to the current state-of-the-art results.
What’s Worthy of Comment? Content and Comment Volume in Political Blogs
Yano, Tae (Carnegie Mellon University) | Smith, Noah A. (Carnegie Mellon University)
In research on blog data, comments are often ignored, What makes a blog post noteworthy? One measure of the and it is easy to see why: comments are very noisy, full popularity or breadth of interest of a blog post is the extent of nonstandard grammar and spelling, usually unedited, often to which readers of the blog are inspired to leave comments cryptic and uninformative, at least to those outside the on the post. In this paper, we study the relationship between blog's community. A few studies have focused on information the text contents of a blog post and the volume of response in comments. Mishe and Glance (2006) showed the it will receive from blog readers. Modeling this relationship value of comments in characterizing the social repercussions has the potential to reveal the interests of a blog's readership of a post, including popularity and controversy. Their largescale community to its authors, readers, advertisers, and scientists user study correlated popularity and comment activity.
Toward Social Causality: An Analysis of Interpersonal Relationships in Online Blogs and Forums
Girju, Roxana (University of Illinois)
In this paper we present encouraging preliminary results into the problem of social causality (causal reasoning used by intelligent agents in a social environment) in online social interactions based on a model of reciprocity. At every level, social relationships are guided by the shared understanding that most actions call for appropriate reactions, and that inappropriate reactions require management. Thus, we present an analysis of interpersonal relationships in English reciprocal contexts. Specifically, we rely here on a large and recently built database of 10,882 reciprocal relation instances in online media. The resource is analyzed along a set of novel and important dimensions: symmetry, affective value, gender}, and {\em intentionality of action which are highly interconnected. At a larger level, we automatically generate {\em chains of causal relations} between verbs indicating interpersonal relationships. Statistics along these dimensions give insights into people's behavior, judgments, and thus their social interactions.
Study of Static Classification of Social Spam Profiles in MySpace
Irani, Danesh (Georgia Institute of Technology) | Webb, Steve (Georgia Institute of Technology) | Pu, Calton (Georgia Institute of Technology)
Reaching hundreds of millions of users, major social networks have become important target media for spammers. Although practical techniques such as collaborative filters and behavioral analysis are able to reduce spam, they have an inherent lag (to collect sufficient data on the spammer) that also limits their effectiveness. Through an experimental study of over 1.9 million MySpace profiles, we make a case for analysis of static user profile content, possibly as soon as such profiles are created. We compare several machine learning algorithms in their ability to distinguish spam profiles from legitimate profiles. We found that a C4.5 decision tree algorithm achieves the highest accuracy (99.4%) of finding rogue profiles, while naïve Bayes achieves a lower accuracy (92.6%). We also conducted a sensitivity analysis of the algorithms w.r.t. features which may be easily removed by spammers.
An Integrated Modeling Environment to Study the Co-evolution of Networks, Individual Behavior and Epidemics
Barrett, Christopher (Network Dynamics and Sim Science Lab) | Bisset, Keith (Network Dynamics and Sim Science Lab) | Leidig, Jonathan (Network Dynamics and Sim Science Lab) | Marathe, Achla (Network Dynamics and Sim Science Lab) | Marathe, Madhav V. (Network Dynamics and Sim Science Lab)
We discuss an interaction-based approach to study the coevolution between socio-technical networks, individual behaviors, and contagion processes on these networks. We use epidemics in human population as an example of this phenomenon. The methods consist of developing synthetic yet realistic national-scale networks using a first principles approach. Unlike simple random graph techniques, these methods combine real world data sources with behavioral and social theories to synthesize detailed social contact (proximity) networks. Individual-based models of within-host disease progression and inter-host transmission are then used to model the contagion process. Finally, models of individual behaviors are composed with disease progression models to develop a realistic representation of the complex system in which individual behaviors and the social network adapt to the contagion. These methods are embodied within Simdemics – a general purpose modeling environment to support pandemic planning and response. Simdemics is designed specifically to be scalable to networks with 300 million agents – the underlying algorithms and methods in Simdemics are all high-performance computing oriented methods. New advances in network science, machine learning, high performance computing, data mining and behavioral modeling were necessary to develop Simdemics. Simdemics is combined with two other environments, Simfrastructure and Didactic, to form an integrated cyberenvironment. The integrated cyber-environment provides the end-user flexible and seamless Internet based access to Simdemics. Service-oriented architectures play a critical role in delivering the desired services to the end user. Simdemics, in conjunction with the integrated cyber-environment, has been used in over a dozen user defined case studies. These case studies were done to support specific policy questions that arose in the context of planning the response to pandemics (e.g., H1N1, H5N1) and human initiated bio-terrorism events. These studies played a crucial role in the continual development and improvement of the cyber-environment.
Ecological non-linear state space model selection via adaptive particle Markov chain Monte Carlo (AdPMCMC)
Peters, Gareth W., Hosack, Geoff R., Hayes, Keith R.
We develop a novel advanced Particle Markov chain Monte Carlo algorithm that is capable of sampling from the posterior distribution of non-linear state space models for both the unobserved latent states and the unknown model parameters. We apply this novel methodology to five population growth models, including models with strong and weak Allee effects, and test if it can efficiently sample from the complex likelihood surface that is often associated with these models. Utilising real and also synthetically generated data sets we examine the extent to which observation noise and process error may frustrate efforts to choose between these models. Our novel algorithm involves an Adaptive Metropolis proposal combined with an SIR Particle MCMC algorithm (AdPMCMC). We show that the AdPMCMC algorithm samples complex, high-dimensional spaces efficiently, and is therefore superior to standard Gibbs or Metropolis Hastings algorithms that are known to converge very slowly when applied to the non-linear state space ecological models considered in this paper. Additionally, we show how the AdPMCMC algorithm can be used to recursively estimate the Bayesian Cram\'er-Rao Lower Bound of Tichavsk\'y (1998). We derive expressions for these Cram\'er-Rao Bounds and estimate them for the models considered. Our results demonstrate a number of important features of common population growth models, most notably their multi-modal posterior surfaces and dependence between the static and dynamic parameters. We conclude by sampling from the posterior distribution of each of the models, and use Bayes factors to highlight how observation noise significantly diminishes our ability to select among some of the models, particularly those that are designed to reproduce an Allee effect.
Scalable Probabilistic Databases with Factor Graphs and MCMC
Wick, Michael, McCallum, Andrew, Miklau, Gerome
Probabilistic databases play a crucial role in the management and understanding of uncertain data. However, incorporating probabilities into the semantics of incomplete databases has posed many challenges, forcing systems to sacrifice modeling power, scalability, or restrict the class of relational algebra formula under which they are closed. We propose an alternative approach where the underlying relational database always represents a single world, and an external factor graph encodes a distribution over possible worlds; Markov chain Monte Carlo (MCMC) inference is then used to recover this uncertainty to a desired level of fidelity. Our approach allows the efficient evaluation of arbitrary queries over probabilistic databases with arbitrary dependencies expressed by graphical models with structure that changes during inference. MCMC sampling provides efficiency by hypothesizing {\em modifications} to possible worlds rather than generating entire worlds from scratch. Queries are then run over the portions of the world that change, avoiding the onerous cost of running full queries over each sampled world. A significant innovation of this work is the connection between MCMC sampling and materialized view maintenance techniques: we find empirically that using view maintenance techniques is several orders of magnitude faster than naively querying each sampled world. We also demonstrate our system's ability to answer relational queries with aggregation, and demonstrate additional scalability through the use of parallelization.