Information Retrieval
Recommendation as Collaboration in Web Search
Smyth, Barry (CLARITY: Centre for Sensor Web Technologies) | Freyne, Jill (Tasmanian ICT Centre, CSIRO) | Coyle, Maurice (HeyStaks Technologies Limited) | Briggs, Peter (HeyStaks Technologies Limited)
Recommender systems now play an important role in online information discovery, complementing traditional approaches such as search and navigation, with a more proactive approach to discovery that is informed by the users interests and preferences. To date recommender systems have been deployed within a variety of e-commerce domains, covering a range of products such as books, music, movies, and have proven to be a successful way to convert browsers into buyers. Recommendation technologies have a potentially much greater role to play in information discovery however and in this article we consider recent research that takes a fresh look at web search as a fertile platform for recommender systems research as users demand a new generation of search engines that are less susceptible to manipulation and more responsive to searcher needs and preferences.
Query-time Entity Resolution
Entity resolution is the problem of reconciling database references corresponding to the same real-world entities. Given the abundance of publicly available databases that have unresolved entities, we motivate the problem of query-time entity resolution quick and accurate resolution for answering queries over such unclean databases at query-time. Since collective entity resolution approaches --- where related references are resolved jointly --- have been shown to be more accurate than independent attribute-based resolution for off-line entity resolution, we focus on developing new algorithms for collective resolution for answering entity resolution queries at query-time. For this purpose, we first formally show that, for collective resolution, precision and recall for individual entities follow a geometric progression as neighbors at increasing distances are considered. Unfolding this progression leads naturally to a two stage expand and resolve query processing strategy. In this strategy, we first extract the related records for a query using two novel expansion operators, and then resolve the extracted records collectively. We then show how the same strategy can be adapted for query-time entity resolution by identifying and resolving only those database references that are the most helpful for processing the query. We validate our approach on two large real-world publication databases where we show the usefulness of collective resolution and at the same time demonstrate the need for adaptive strategies for query processing. We then show how the same queries can be answered in real-time using our adaptive approach while preserving the gains of collective resolution. In addition to experiments on real datasets, we use synthetically generated data to empirically demonstrate the validity of the performance trends predicted by our analysis of collective entity resolution over a wide range of structural characteristics in the data.
A Comparison of Different Machine Transliteration Models
Machine transliteration is a method for automatically converting words in one language into phonetically equivalent ones in another language. Machine transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms. Four machine transliteration models -- grapheme-based transliteration model, phoneme-based transliteration model, hybrid transliteration model, and correspondence-based transliteration model -- have been proposed by several researchers. To date, however, there has been little research on a framework in which multiple transliteration models can operate simultaneously. Furthermore, there has been no comparison of the four models within the same framework and using the same data. We addressed these problems by 1) modeling the four models within the same framework, 2) comparing them under the same conditions, and 3) developing a way to improve machine transliteration through this comparison. Our comparison showed that the hybrid and correspondence-based models were the most effective and that the four models can be used in a complementary manner to improve machine transliteration performance.
An Expressive Language and Efficient Execution System for Software Agents
Software agents can be used to automate many of the tedious, time-consuming information processing tasks that humans currently have to complete manually. However, to do so, agent plans must be capable of representing the myriad of actions and control flows required to perform those tasks. In addition, since these tasks can require integrating multiple sources of remote information ? typically, a slow, I/O-bound process ? it is desirable to make execution as efficient as possible. To address both of these needs, we present a flexible software agent plan language and a highly parallel execution system that enable the efficient execution of expressive agent plans. The plan language allows complex tasks to be more easily expressed by providing a variety of operators for flexibly processing the data as well as supporting subplans (for modularity) and recursion (for indeterminate looping). The executor is based on a streaming dataflow model of execution to maximize the amount of operator and data parallelism possible at runtime. We have implemented both the language and executor in a system called THESEUS. Our results from testing THESEUS show that streaming dataflow execution can yield significant speedups over both traditional serial (von Neumann) as well as non-streaming dataflow-style execution that existing software and robot agent execution systems currently support. In addition, we show how plans written in the language we present can represent certain types of subtasks that cannot be accomplished using the languages supported by network query engines. Finally, we demonstrate that the increased expressivity of our plan language does not hamper performance; specifically, we show how data can be integrated from multiple remote sources just as efficiently using our architecture as is possible with a state-of-the-art streaming-dataflow network query engine.
Many Bills: Visualizing the Anatomy of Congressional Legislation
Aktolga, Elif (University of Massachusetts Amherst) | Ros, Irene (IBM Watson Research Center) | Assogba, Yannick (IBM Watson Research Center) | DiMicco, Joan (IBM Watson Research Center)
US Federal Legislation is a common subject of discussion and advocacy on the web. The contents of bills present a significant challenge to both experts and average citizens due to their length and complex legal language. To make bills more accessible to the general public, we present Many Bills: a web-based visualization prototype that reveals the underlying semantics of a bill. We classify the sections of a bill into topics and visualize them using different colors. Further, using information retrieval techniques, we locate sections that don't seem to fit with the overall topic of the bill. To highlight outliers in our `misfit mode', we visualize them in red, which builds a contrast against the remaining gray sections. Both topic and misfit visualizations provide an overview and detail view of bills, enabling users to read individual sections of a bill and compare topic patterns across multiple bills. We obtained initial user feedback and continue collecting label corrections from users through the interface.
Towards Large-Scale Collaborative Planning: Answering High-Level Search Queries Using Human Computation
Law, Edith (Carnegie Mellon University) | Zhang, Haoqi (Harvard University)
Behind every search query is a high-level mission that the user wants to accomplish. While current search engines can often provide relevant information in response to well-specified queries, they place the heavy burden of making a plan for achieving a mission on the user. We take the alternative approach of tackling users' high-level missions directly by introducing a human computation system that generates simple plans, by decomposing a mission into goals and retrieving search results tailored to each goal. Results show that our system is able to provide users with diverse, actionable search results and useful roadmaps for accomplishing their missions.
A Whole Page Click Model to Better Interpret Search Engine Click Data
Chen, Weizhu (Microsoft Research Asia and Hong Kong University of Science and Technology) | Ji, Zhanglong (Microsoft Research Asia) | Shen, Si (Hong Kong University of Science and Technology) | Yang, Qiang (Hong Kong University of Science and Technology)
Recent advances in click modeling have established it as an attractive approach to interpret search click data. These advances characterize users' search behavior either in advertisement blocks, or within an organic search block through probabilistic models. Yet, when searching for information on a search result page, one is often interacting with the search engine via an entire page instead of a single block. Consequently, previous works that exclusively modeled user behavior in a single block may sacrifice much useful user behavior information embedded in other blocks. To solve this problem, in this paper, we put forward a novel Whole Page Click (WPC) Model to characterize user behavior in multiple blocks. Specifically, WPC uses a Markov chain to learn the user transition probabilities among different blocks in the whole page. To compare our model with the best alternatives in the Web-Search literature, we run a large-scale experiment on a real dataset and demonstrate the advantage of the WPC model in terms of both the whole page and each block in the page. Especially, we find that WPC can achieve significant gain in interpreting the advertisement data, despite of the sparsity of the advertisement click data.
Monitoring Entities in an Uncertain World: Entity Resolution and Referential Integrity
Minton, Steven N. (InferLink Corporation) | Macskassy, Sofus A. (Fetch Technologies) | LaMonica, Peter (Air Force Research Laboratory) | See, Kane (Fetch Technologies) | Knoblock, Craig A. (University of Southern California) | Barish, Greg (Fetch Technologies) | Michelson, Matthew (Fetch Technologies) | Liuzzi, Raymond (Raymond Technologies)
This paper describes a system to help intelligence analysts track and analyze information being published in multiple sources, particularly open sources on the Web. The system integrates technology for Web harvesting, natural language extraction, and network analytics, and allows analysts to view and explore the results via a Web application. One of the difficult problems we address is the entity resolution problem, which occurs when there are multiple, differing ways to refer to the same entity. The problem is particularly complex when noisy data is being aggregated over time, there is no clean master list of entities, and the entities under investigation are intentionally being deceptive. Our system must not only perform entity resolution with noisy data, but must also gracefully recover when entity resolution mistakes are subsequently corrected. We present a case study in arms trafficking that illustrates the issues, and describe how they are addressed.
The News that Matters to You: Design and Deployment of a Personalized News Service
Stefik, Mark Jeffrey (PARC) | Good, Lance (Google)
With the growth of online information, many people are challenged in finding and reading the information most important for their interests. From 2008-2010 we built an experimental personalized news system where readers can subscribe to organized channels of information that are curated by experts. AI technology was employed to radically reduce the work load of curators and to efficiently present information to readers. The system has gone through three implementation cycles and processed over 16 million news stories from about 12,000 RSS feeds on over 8000 topics organized by 160 curators for over 600 registered readers. This paper describes the approach, engineering and AI technology of the system.
Web Personalization and Cohort Information Services for Natural Resource Managers
Redman, Crystal E. (Colorado State University)
Their information needs are long and popular information needs of the masses. Topic term and highly dynamic - nearly everything about this topic specificity, customizability, and automatically pursuing the is in flux. For these users, information search can be made long term unique information needs of individual users are more effective with knowledge about the field and about the not among the strengths of current main stream search engines types of documents being retrieved. Because the resource (Jansen, Spink, and Saracevic 2000) (Teevan, Dumais, management decisions require judgment about the materials and Horvitz 2005). This gap has inspired web personalization collected, the users require confidentiality and must trust the and collaborative information seeking tools such as sources. Google Alerts and has encouraged topic-specific blogs and Matilda is designed to 1) tailor information collection for podcasts.