Information Retrieval
NewsFinder: Automating an AI News Service
Eckroth, Joshua (The Ohio State University) | Dong, Liang (Clemson University) | Smith, Reid G. (Marathon Oil Corporation) | Buchanan, Bruce G. (University of Pittsburgh)
NewsFinder automates the steps involved in finding, selecting, categorizing, and publishing news stories that meet relevance criteria for the Artificial Intelligence community. The software combines a broad search of online news sources with topic-specific trained models and heuristics. Since August 2010, the program has been used to operate the AI in the News service that is part of the AAAI AITopics website.
Online Structured Prediction via Coactive Learning
Shivaswamy, Pannaga, Joachims, Thorsten
We propose Coactive Learning as a model of interaction between a learning system and a human user, where both have the common goal of providing results of maximum utility to the user. At each step, the system (e.g. search engine) receives a context (e.g. query) and predicts an object (e.g. ranking). The user responds by correcting the system if necessary, providing a slightly improved -- but not necessarily optimal -- object as feedback. We argue that such feedback can often be inferred from observable user behavior, for example, from clicks in web-search. Evaluating predictions by their cardinal utility to the user, we propose efficient learning algorithms that have ${\cal O}(\frac{1}{\sqrt{T}})$ average regret, even though the learning algorithm never observes cardinal utility values as in conventional online learning. We demonstrate the applicability of our model and learning algorithms on a movie recommendation task, as well as ranking for web-search.
On the Difficulty of Nearest Neighbor Search
He, Junfeng, Kumar, Sanjiv, Chang, Shih-Fu
Fast approximate nearest neighbor (NN) search in large databases is becoming popular. Several powerful learning-based formulations have been proposed recently. However, not much attention has been paid to a more fundamental question: how difficult is (approximate) nearest neighbor search in a given data set? And which data properties affect the difficulty of nearest neighbor search and how? This paper introduces the first concrete measure called Relative Contrast that can be used to evaluate the influence of several crucial data characteristics such as dimensionality, sparsity, and database size simultaneously in arbitrary normed metric spaces. Moreover, we present a theoretical analysis to prove how the difficulty measure (relative contrast) determines/affects the complexity of Local Sensitive Hashing, a popular approximate NN search method. Relative contrast also provides an explanation for a family of heuristic hashing algorithms with good practical performance based on PCA. Finally, we show that most of the previous works in measuring NN search meaningfulness/difficulty can be derived as special asymptotic cases for dense vectors of the proposed measure.
Keyphrase Based Arabic Summarizer (KPAS)
El-Shishtawy, Tarek, El-Ghannam, Fatma
This paper describes a computationally inexpensive and efficient generic summarization algorithm for Arabic texts. The algorithm belongs to extractive summarization family, which reduces the problem into representative sentences identification and extraction sub-problems. Important keyphrases of the document to be summarized are identified employing combinations of statistical and linguistic features. The sentence extraction algorithm exploits keyphrases as the primary attributes to rank a sentence. The present experimental work, demonstrates different techniques for achieving various summarization goals including: informative richness, coverage of both main and auxiliary topics, and keeping redundancy to a minimum. A scoring scheme is then adopted that balances between these summarization goals. To evaluate the resulted Arabic summaries with well-established systems, aligned English/Arabic texts are used through the experiments.
Comparison-Based Learning with Rank Nets
Karbasi, Amin, Ioannidis, Stratis, Massoulie, laurent
We consider the problem of search through comparisons, where a user is presented with two candidate objects and reveals which is closer to her intended target. We study adaptive strategies for finding the target, that require knowledge of rank relationships but not actual distances between objects. We propose a new strategy based on rank nets, and show that for target distributions with a bounded doubling constant, it finds the target in a number of comparisons close to the entropy of the target distribution and, hence, of the optimum. We extend these results to the case of noisy oracles, and compare this strategy to prior art over multiple datasets.
Bayesian Locality Sensitive Hashing for Fast Similarity Search
Satuluri, Venu, Parthasarathy, Srinivasan
Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. Locality-sensitive hashing (LSH) based methods have become a very popular approach for this problem. However, most such methods only use LSH for the first phase of similarity search - i.e. efficient indexing for candidate generation. In this paper, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search - performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. BayesLSH is able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH's output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH, BayesLSH enables significant speedups, typically in the range 2x-20x for a wide variety of datasets.
SearchBuddies: Bringing Search Engines into the Conversation
Hecht, Brent (Northwestern University) | Teevan, Jaime (Microsoft Research) | Morris, Meredith Ringel (Microsoft Research) | Liebling, Dan (Microsoft Research)
Although people receive trusted, personalized recommendations and auxiliary social benefits when they ask questions of their friends, using a search engine is often a more effective way to find an answer. Attempts to integrate social and algorithmic search have thus far focused on bringing social content into algorithmic search results. However, more of the benefits of social search can be preserved by reversing this approach and bringing algorithmic content into natural question-based conversations. To do this successfully, it is necessary to adapt search engine interaction to a social context. In this paper, we present SearchBuddies, a system that responds to Facebook status message questions with algorithmic search results. Via a three-month deployment of the system to 122 social network users, we explore how people responded to search content in a highly social environment. Our experience deploying SearchBuddies shows that a socially embedded search engine can successfully provide users with unique and highly relevant information in a social context and can be integrated into conversations around an information need. The deployment also illuminates specific challenges of embedding a search engine in a social environment and provides guidance toward solutions.
So.cl: An Interest Network for Informal Learning
Farnham, Shelly Diane (Microsoft Research) | Lahav, Michal (Microsoft Research) | Raskino, David (Microsoft Research) | Cheng, Lili (Microsoft Research) | Laird-McConnell, Tom (Microsoft Research)
Web search engines emerged prior to the dominance of social media. What if we imagined search as integrating with social media from the ground up? So.cl is a web application that combines web browsing, search, and social networking for the purposes of sharing and learning around topics of interest. In this paper, we present the results of a deployment study examining existing learning practices around search and social networking for students, and how these practices shifted when participants adopted So.cl. We found prior to using So.cl that students already heavily employed search tools and social media for learning. With the use of So.cl, we found that users engaged in lightweight, fun social sharing and learning for informal, personal topics, but not for more heavyweight collaboration around school or work. The public nature of So.cl encouraged users to post search results as much for self-expression as for searching, enabling serendipitous discovery around interests.
Transductive Learning for Real-Time Twitter Search
Zhang, Xin (Graduate University of Chinese Academy of Sciences) | He, Ben (Graduate University of Chinese Academy of Sciences) | Luo, Tiejian (Graduate University of Chinese Academy of Sciences)
Recency is an important dimension of relevance for real-time Twitter search as users tend to be interested in fresh news and events. By incorporating various sources of evidence, the application of learning to rank (LTR) algorithms to real-time Twitter search has shown beneficial in finding not only relevant, but also recent tweets in response to given queries. However, the potential effectiveness brought by LTR may not have been fully exploited due to the lack of labeled data available for properly learning a ranking model, since human labels are expensive in real-world applications. To this end, this paper proposes a transductive algorithm that incrementally aggregate the labeled tweets through an iterative process. Experimental results on the standard Tweets11 dataset show that our approach is able to outperform strong baselines without the use of human labels.
Enhancing Event Descriptions through Twitter Mining
Tanev, Hristo (Joint Research Centre, European Commission) | Ehrmann, Maud (Joint Research Centre, European Commission) | Piskorski, Jakub (Frontex) | Zavarella, Vanni (Joint Research Centre, European Commission)
We describe a simple IR approach for linking news about events, detected by an event extraction system, to messages from Twitter (tweets). In particular, we explore several methods for creating event-specific queries for Twitter and provide a quantitative and qualitative evaluation of the relevance and usefulness of the information obtained from the tweets. We showed that methods based on utilization of word co-occurrence clustering, domain-specific keywords and named entity recognition improve the performance with respect to a basic approach.