Goto

Collaborating Authors

 Law


Utilisation of Metadata Fields and Query Expansion in Cross-Lingual Search of User-Generated Internet Video

Journal of Artificial Intelligence Research

Recent years have seen significant efforts in the area of Cross Language Information Retrieval (CLIR) for text retrieval. This work initially focused on formally published content, but more recently research has begun to concentrate on CLIR for informal social media content. However, despite the current expansion in online multimedia archives, there has been little work on CLIR for this content. While there has been some limited work on Cross-Language Video Retrieval (CLVR) for professional videos, such as documentaries or TV news broadcasts, there has to date, been no significant investigation of CLVR for the rapidly growing archives of informal user generated (UGC) content. Key differences between such UGC and professionally produced content are the nature and structure of the textual UGC metadata associated with it, as well as the form and quality of the content itself. In this setting, retrieval effectiveness may not only suffer from translation errors common to all CLIR tasks, but also recognition errors associated with the automatic speech recognition (ASR) systems used to transcribe the spoken content of the video and with the informality and inconsistency of the associated user-created metadata for each video. This work proposes and evaluates techniques to improve CLIR effectiveness of such noisy UGC content. Our experimental investigation shows that different sources of evidence, e.g. the content from different fields of the structured metadata, significantly affect CLIR effectiveness. Results from our experiments also show that each metadata field has a varying robustness to query expansion (QE) and hence can have a negative impact on the CLIR effectiveness. Our work proposes a novel adaptive QE technique that predicts the most reliable source for expansion and shows how this technique can be effective for improving the CLIR effectiveness for UGC content.


Bayesian Estimation of Bipartite Matchings for Record Linkage

arXiv.org Machine Learning

The bipartite record linkage task consists of merging two disparate datafiles containing information on two overlapping sets of entities. This is non-trivial in the absence of unique identifiers and it is important for a wide variety of applications given that it needs to be solved whenever we have to combine information from different sources. Most statistical techniques currently used for record linkage are derived from a seminal paper by Fellegi and Sunter (1969). These techniques usually assume independence in the matching statuses of record pairs to derive estimation procedures and optimal point estimators. We argue that this independence assumption is unreasonable and instead target a bipartite matching between the two datafiles as our parameter of interest. Bayesian implementations allow us to quantify uncertainty on the matching decisions and derive a variety of point estimators using different loss functions. We propose partial Bayes estimates that allow uncertain parts of the bipartite matching to be left unresolved. We evaluate our approach to record linkage using a variety of challenging scenarios and show that it outperforms the traditional methodology. We illustrate the advantages of our methods merging two datafiles on casualties from the civil war of El Salvador.


Client Profiling for an Anti-Money Laundering System

arXiv.org Artificial Intelligence

Acts of prevention and fight against money laundering (ML) crimes are prioritized by almost every government in the world, at the same level of the most relevant global issues. Money laundering is a crime that typically consists in making a certain illegal financial gain into a legal gain. According to the United Nations Office on Drugs and Crimes (UNODC) the annual global estimate of laundered money is about 2% - 5% of the Gross World Product, or US$800 billion - US$2 trillion [1]. As if the financial volume were not enough, another reason for governments to focus on this crime is for the fact that it is clearly connected to other types of crimes such as illegal drug trade, fraud, corruption, kidnapping, terrorism, arms smuggling, among others. Most countries' financial authorities, usually Central Banks, are responsible for controlling and defining antimoney laundering (AML) regulations, demanding from financial institutions the implementation of procedures that apply the defined norms.


Research Priorities for Robust and Beneficial Artificial Intelligence

AI Magazine

Success in the quest for artificial intelligence has the potential to bring unprecedented benefits to humanity, and it is therefore worthwhile to investigate how to maximize these benefits while avoiding potential pitfalls. This article gives numerous examples (which should by no means be construed as an exhaustive list) of such worthwhile research aimed at ensuring that AI remains robust and beneficial.


Human Memory Search as Initial-Visit Emitting Random Walk

Neural Information Processing Systems

Imagine a random walk that outputs a state only when visiting it for the first time. The observed output is therefore a repeat-censored version of the underlying walk, and consists of a permutation of the states or a prefix of it. We call this model initial-visit emitting random walk (INVITE). Prior work has shown that the random walks with such a repeat-censoring mechanism explain well human behavior in memory search tasks, which is of great interest in both the study of human cognition and various clinical applications. However, parameter estimation in INVITE is challenging, because naive likelihood computation by marginalizing over infinitely many hidden random walk trajectories is intractable. In this paper, we propose the first efficient maximum likelihood estimate (MLE) for INVITE by decomposing the censored output into a series of absorbing random walks. We also prove theoretical properties of the MLE including identifiability and consistency. We show that INVITE outperforms several existing methods on real-world human response data from memory search tasks.


CrossCat: A Fully Bayesian Nonparametric Method for Analyzing Heterogeneous, High Dimensional Data

arXiv.org Machine Learning

There is a widespread need for statistical methods that can analyze high-dimensional datasets with- out imposing restrictive or opaque modeling assumptions. This paper describes a domain-general data analysis method called CrossCat. CrossCat infers multiple non-overlapping views of the data, each consisting of a subset of the variables, and uses a separate nonparametric mixture to model each view. CrossCat is based on approximately Bayesian inference in a hierarchical, nonparamet- ric model for data tables. This model consists of a Dirichlet process mixture over the columns of a data table in which each mixture component is itself an independent Dirichlet process mixture over the rows; the inner mixture components are simple parametric models whose form depends on the types of data in the table. CrossCat combines strengths of mixture modeling and Bayesian net- work structure learning. Like mixture modeling, CrossCat can model a broad class of distributions by positing latent variables, and produces representations that can be efficiently conditioned and sampled from for prediction. Like Bayesian networks, CrossCat represents the dependencies and independencies between variables, and thus remains accurate when there are multiple statistical signals. Inference is done via a scalable Gibbs sampling scheme; this paper shows that it works well in practice. This paper also includes empirical results on heterogeneous tabular data of up to 10 million cells, such as hospital cost and quality measures, voting records, unemployment rates, gene expression measurements, and images of handwritten digits. CrossCat infers structure that is consistent with accepted findings and common-sense knowledge in multiple domains and yields predictive accuracy competitive with generative, discriminative, and model-free alternatives.


Ethical Artificial Intelligence

arXiv.org Artificial Intelligence

This book-length article combines several peer reviewed papers and new material to analyze the issues of ethical artificial intelligence (AI). The behavior of future AI systems can be described by mathematical equations, which are adapted to analyze possible unintended AI behaviors and ways that AI designs can avoid them. This article makes the case for utility-maximizing agents and for avoiding infinite sets in agent definitions. It shows how to avoid agent self-delusion using model-based utility functions and how to avoid agents that corrupt their reward generators (sometimes called "perverse instantiation") using utility functions that evaluate outcomes at one point in time from the perspective of humans at a different point in time. It argues that agents can avoid unintended instrumental actions (sometimes called "basic AI drives" or "instrumental goals") by accurately learning human values. This article defines a self-modeling agent framework and shows how it can avoid problems of resource limits, being predicted by other agents, and inconsistency between the agent's utility function and its definition (one version of this problem is sometimes called "motivated value selection"). This article also discusses how future AI will differ from current AI, the politics of AI, and the ultimate use of AI to help understand the nature of the universe and our place in it.


Expressiveness of Two-Valued Semantics for Abstract Dialectical Frameworks

Journal of Artificial Intelligence Research

By expressiveness we mean the ability to encode a desired set of two-valued interpretations over a given propositional vocabulary A using only atoms from A. We also compare ADFs' expressiveness with that of (the two-valued semantics of) abstract argumentation frameworks, normal logic programs and propositional logic. While the computational complexity of the two-valued model existence problem for all these languages is (almost) the same, we show that the languages form a neat hierarchy with respect to their expressiveness. We then demonstrate that this hierarchy collapses once we allow to introduce a linear number of new vocabulary elements. We finally also analyse and compare the representational succinctness of ADFs (for two-valued model semantics), that is, their capability to represent two-valued interpretation sets in a space-efficient manner.


Extracting Structured Information via Automatic + Human Computation

AAAI Conferences

We present a system for extracting structured information from unstructured text using a combination of information retrieval, natural language processing, machine learning, and crowdsourcing. We test our pipeline by building a structured database of gun violence incidents in the United States. The results of our pilot study demonstrate that the proposed methodology is a viable way of collecting large-scale, up-to-date data for public health, public policy, and social science research.


Machine Interface for Contracting Assistance

AAAI Conferences

We describe a cognitive assistant in early-stage development for the United States Air Force as an aid to contracting officers and potential commercial offerors for navigating the government-contracting process. The goal is easing compliance and affording flexibility and transparency so as to support an innovative and rapid acquisition process. The motivation, use cases, and technical approach for MICA, a Machine Interface for Contracting Assistance, are discussed here along with the technical challenges posed.