Goto

Collaborating Authors

 Information Retrieval


Learning Hash Functions for Cross-View Similarity Search

AAAI Conferences

Many applications in Multilingual and Multimodal Information Access involve searching large databases of high dimensional data objects with multiple (conditionally independent) views. In this work we consider the problem of learning hash functions for similarity search across the views for such applications. We propose a principled method for learning a hash function for each view given a set of multiview training data objects. The hash functions map similar objects to similar codes across the views thus enabling cross-view similarity search. We present results from an extensive empirical study of the proposed approach which demonstrate its effectiveness on Japanese language People Search and Multilingual People Search problems.



An Efficient Framework for Constructing Generalized Locally-Induced Text Metrics

AAAI Conferences

In this paper, we propose a new framework for constructing text metrics which can be used to compare and support inferences among terms and sets of terms. Our metric is derived from data-driven kernels on graphs that let us capture global relations among terms and sets of terms, regardless of their complexity and size. To compute the metric efficiently for any two subsets of terms, we develop an approximation technique that relies on the precompiled term-term similarities. To scale-up the approach to problems with huge number of terms, we develop and experiment with a solution that subsamples the term space. We demonstrate the benefits of the whole framework on two text inference tasks: prediction of terms in the article from its abstract and query expansion in information retrieval.


An Assertion Retrieval Algebra for Object Queries over Knowledge Bases

AAAI Conferences

We consider a generalization of instance retrieval over knowledge bases that provides users with assertions in which descriptions of qualifying objects are given in addition to their identifiers. Notably, this involves a transfer of basic database paradigms involving caching and query rewriting in the context of an assertion retrieval algebra. We present an optimization framework for this algebra, with a focus on finding plans that avoid any need for general knowledge base reasoning at query execution time when sufficient cached results of earlier requests exist.


Why do People Retweet? Anti-Homophily Wins the Day!

AAAI Conferences

Twitter and other microblogs have rapidly become a significant means by which people communicate with the world and each other in near realtime. There has been a large number of studies surrounding these social media, focusing on areas such as information spread, various centrality measures, topic detection and more. However, one area which has not received much attention is trying to better understand what information is being spread and why it is being spread. This work looks to get a better understanding of what makes people spread information in tweets or microblogs through the use of retweeting. Several retweet behavior models are presented and evaluated on a Twitter data set consisting of over 768,000 tweets gathered from monitoring over 30,000 users for a period of one month. We evaluate the proposed models against each user and show how people use different retweet behavior models. For example, we find that although users in the majority of cases do not retweet information on topics that they themselves Tweet about as or from people who are "like them" (hence anti-homophily), we do find that models which do take homophily, or similarity, into account fits the observed retweet behaviors much better than other more general models which do not take this into account. We further find that, not surprisingly, people's retweeting behavior is better explained through multiple different models rather than one model.


Using the H-Index to Estimate Blog Authority

AAAI Conferences

Link analysis is a technique frequently used in the ranking of web sites. On the web, we often encounter content that is organized by entries, sorted from recent to old, and generally follows the structure of a blog. In this paper we explore and evaluate the usage of a bibliometrics measure, called h-index, for the task of blog ranking, in an information retrieval context. We base our experiments on the TREC Blogs08 collection, which comprises over 28 million posts. The results obtained indicate that the h-index is a robust metric that allows for an improved relevance discrimination between blogs, when compared to the in-degree. Additionally, tests performed using distinct versions of the post graph, indicate that this metric might tolerate a certain level of link clutter.


Event Summarization Using Tweets

AAAI Conferences

Twitter has become exceedingly popular, with hundreds of millions of tweets being posted every day on a wide variety of topics. This has helped make real-time search applications possible with leading search engines routinely displaying relevant tweets in response to user queries. Recent research has shown that a considerable fraction of these tweets are about "events," and the detection of novel events in the tweet-stream has attracted a lot of research interest. However, very little research has focused on properly displaying this real-time information about events. For instance, the leading search engines simply display all tweets matching the queries in reverse chronological order. In this paper we argue that for some highly structured and recurring events, such as sports, it is better to use more sophisticated techniques to summarize the relevant tweets. We formalize the problem of summarizing event-tweets and give a solution based on learning the underlying hidden state representation of the event via Hidden Markov Models. In addition, through extensive experiments on real-world data we show that our model significantly outperforms some intuitive and competitive baselines.


NPCEditor: Creating Virtual Human Dialogue Using Information Retrieval Techniques

AI Magazine

See Leuski et al. (2006) and to the same question -- for example, "What Leuski and Traum (2008) for more details. is your name?" -- depending on who the interactor The final parameter is the classification threshold is looking at. NPCEditor's user interface allows the on the KL-divergence value: only answers that designer to define arbitrary annotation classes or score above the threshold value are returned from categories and specify which of these annotation the classifier. The threshold is determined by tuning categories should be used in classification.


Cancer: A Computational Disease that AI Can Cure

AI Magazine

Cancer kills millions of people each year. From an AI perspective, finding effective treatments for cancer is a high-dimensional search problem characterized by many molecularly distinct cancer subtypes, many potential targets and drug combinations, and a dearth of high quality data to connect molecular subtypes and treatments to responses. The broadening availability of molecular diagnostics and electronic medical records, presents both opportunities and challenges to apply AI techniques to personalize and improve cancer treatment. We discuss these in the context of Cancer Commons, a “rapid learning” community where patients, physicians, and researchers collect and analyze the molecular and clinical data from every cancer patient, and use these results to individualize therapies. Research opportunities include: adaptively-planning and executing individual treatment experiments across the whole patient population, inferring the causal mechanisms of tumors, predicting drug response in individuals, and generalizing these findings to new cases. The goal is to treat each patient in accord with the best available knowledge, and to continually update that knowledge to benefit subsequent patients. Achieving this goal is a worthy grand challenge for AI.


A Preliminary Evaluation of Machine Learning in Algorithm Selection for Search Problems

AAAI Conferences

Machine learning is an established method of selecting algorithms to solve hard search problems. Despite this, to date no systematic comparison and evaluation of the different techniques has been performed and the performance of existing systems has not been critically compared to other approaches. We compare machine learning techniques for algorithm selection on real-world data sets of hard search problems. In addition to well-established approaches, for the first time we also apply statistical relational learning to this problem. We demonstrate that most machine learning techniques and existing systems perform less well than one might expect. To guide practitioners, we close by giving clear recommendations as to which machine learning techniques are likely to perform well based on our experiments.