Information Retrieval
Materializing and Persisting Inferred and Uncertain Knowledge in RDF Datasets
McGlothlin, James P. (The University of Texas at Dallas) | Khan, Latifur (The University of Texas At Dallas)
As the semantic web grows in popularity and enters the mainstream of computer technology, RDF (Resource Description Framework) datasets are becoming larger and more complex. Advanced semantic web ontologies, especially in medicine and science, are developing. As more complex ontologies are developed, there is a growing need for efficient queries that handle inference. In areas such as research, it is vital to be able to perform queries that retrieve not just facts but also inferred knowledge and uncertain information. OWL (Web Ontology Language) defines rules that govern provable inference in semantic web datasets. In this paper, we detail a database schema using bit vectors that is designed specifically for RDF datasets. We introduce a framework for materializing and storing inferred triples. Our bit vector schema enables storage of inferred knowledge without a query performance penalty. Inference queries are simplified and performance is improved. Our evaluation results demonstrate that our inference solution is more scalable and efficient than the current state-of-the-art. There are also standards being developed for representing probabilistic reasoning within OWL ontologies. We specify a framework for materializing uncertain information and probabilities using these ontologies. We define a multiple vector schema for representing probabilities and classifying uncertain knowledge using thresholds. This solution increases the breadth of information that can be efficiently retrieved.
Diversifying Query Suggestion Results
Ma, Hao (The Chinese University of Hong Kong) | Lyu, Michael R. (The Chinese University of Hong Kong) | King, Irwin (The Chinese University of Hong Kong)
In order to improve the user search experience, Query Suggestion, a technique for generating alternative queries to Web users, has become an indispensable feature for commercial search engines. However, previous work mainly focuses on suggesting relevant queries to the original query while ignoring the diversity in the suggestions, which will potentially dissatisfy Web users' information needs. In this paper, we present a novel unified method to suggest both semantically relevant and diverse queries to Web users. The proposed approach is based on Markov random walk and hitting time analysis on the query-URL bipartite graph. It can effectively prevent semantically redundant queries from receiving a high rank, hence encouraging diversities in the results. We evaluate our method on a large commercial clickthrough dataset in terms of relevance measurement and diversity measurement. The experimental results show that our method is very effective in generating both relevant and diverse query suggestions.
Towards an Intelligent Code Search Engine
Kim, Jinhan (Pohang University of Science and Technology) | Lee, Sanghoon (Pohang University of Science and Technology) | Hwang, Seung-won (Pohang University of Science and Technology) | Kim, Sunghun (Hong Kong University of Science and Technology)
Software developers increasingly rely on information from the Web, such as documents or code examples on Application Programming Interfaces (APIs), to facilitate their development processes. However, API documents often do not include enough information for developers to fully understand the API usages, while searching for good code examples requires non-trivial efforts. To address this problem, we propose a novel code search engine, combining the strength of browsing documents and searching for code examples, by returning documents embedded with high-quality code example summaries mined from the Web. Our evaluation results show that our approach provides code examples with high precision and boosts programmer productivity.
Session Based Click Features for Recency Ranking
Inagaki, Yoshiyuki (Yahoo Labs) | Sadagopan, Narayanan (Yahoo Labs) | Dupret, Georges (Yahoo Labs) | Dong, Anlei (Yahoo Labs) | Liao, Ciya (Yahoo Labs) | Chang, Yi (Yahoo Labs) | Zheng, Zhaohui (Yahoo Labs)
Recency ranking refers to the ranking of web results by accounting for both relevance and freshness. This is particularly important for "recency sensitive" queries such as breaking news queries. In this study, we propose a set of novel click features to improve machine learned recency ranking. Rather than computing simple aggregate click through rates, we derive these features using the temporal click through data and query reformulation chains. One of the features that we use is click buzz that captures the spiking interest of a url for a query. We also propose time weighted click through rates which treat recent observations as being exponentially more important. The promotion of fresh content is typically determined by the query intent which can change dynamically over time. Quite often users query reformulations convey clues about the query's intent. Hence we enrich our click features by following query reformulations which typically benefit the first query in the chain of reformulations. Our experiments show these novel features can improve the NDCG5 of a major online search engine's ranking for "recency sensitive" queries by up to 1.57%. This is one of the very few studies that exploits temporal click through data and query reformulations for recency ranking.
Optimal Strategies for Reviewing Search Results
Huang, Jeff (University of Washington) | Kazeykina, Anna (Moscow State University)
Web search engines respond to a query by returning more results than can be reasonably reviewed. These results typically include the title, link, and snippet of content from the target link. Each result has the potential to be useful or useless and thus reviewing it has a cost and potential benefit. This paper studies the behavior of a rational agent in this setting, whose objective is to maximize the probability of finding a satisfying result while minimizing cost. We propose two similar agents with different capabilities: one that only compares result snippets relatively and one that predicts from the result snippet whether the result will be satisfying. We prove that the optimal strategy for both agents is a stopping rule: the agent reviews a fixed number of results until the marginal cost is greater than the marginal expected benefit, maximizing the overall expected utility. Finally, we discuss the relationship between rational agents and search users and how our findings help us understand reviewing behaviors.
Adopting Inference Networks for Online Thread Retrieval
Bhatia, Sumit (Pennsylvania State University) | Mitra, Prasenjit (Pennsylvania State University)
Online forums contain valuable human-generated information. End-users looking for information would like to find only those threads in forums where relevant information is present. Due to the distinctive characteristics of forum pages from generic web pages, special techniques are required to organize and search for information in these forums. Threads and pages in forums are different from other webpages in their hyperlinking patterns. Forum posts also have associated social and non-textual metadata. In this paper, we propose a model for online thread retrieval based on inference networks that utilizes the structural properties of forum threads. We also investigate the effects of incorporating various relevance indicators in our model. We empirically show the effectiveness of our proposed model using real-world data.
Clickthrough Log Analysis by Collaborative Ranking
Cao, Bin (Hong Kong University of Science and Technology) | Shen, Dou (Microsoft) | Wang, Kuansan (Microsoft) | Yang, Qiang (Hong Kong University of Science and Technology)
Analyzing clickthrough log data is important for improving search performance as well as understanding user behaviors. In this paper, we propose a novel collaborative ranking model to tackle two difficulties in analyzing clickthrough log. First, previous studies have shown that users tend to click top-ranked results even they are less relevant. Therefore, we use pairwise ranking relation to avoid the position bias in clicks. Second, since click data are extremely sparse with respect to each query or user, we construct a collaboration model to eliminate the sparseness problem. We also find that the proposed model and previous popular used click-based models address different aspects of clickthrough log data. We further propose a hybrid model that can achieve significant improvement compared to the baselines on a large-scale real world dataset.
A Course-Long Information Retrieval Project
Kauchak, David (Pomona College)
In this paper, we describe the outline for a course-long information retrieval (IR) project. The project guides the students in constructing a working IR system from the ground up. The first half of the project is structured and closely follows common foundational IR concepts. During this portion of the project, a bare-bones IR system is constructed. For the last half of the project, students (in groups) implement research-driven extensions to the basic system with the additional constraint that their project must integrate with the base system. By the end, the students have worked on a large software project (~40 classes with thousands of lines of code) in a group setting as well as been introduced to the research process. This project plan has been successfully used in an undergraduate course; resources including starter code, solutions, and an example IR system with project write-ups are available.
A Comparison of Generated Wikipedia Profiles Using Social Labeling and Automatic Keyword Extraction
Russell, Terrell (University of North Carolina at Chapel Hill) | Suh, Bongwon (Palo Alto Research Center) | Chi, Ed H. (Palo Alto Research Center)
In many collaborative systems, researchers are interested in creating representative user profiles. In this paper, we are particularly interested in using social labeling and automatic keyword extraction techniques for generating user profiles. Social labeling is a process in which users manually tag other users with keywords. Automatic keyword extraction is a technique that selects the most salient words to represent a user’s contribution. We apply each of these two profile generation methods to highly active Wikipedia editors and their contributions, and compare the results. We found that profiles generated through social labeling matches the profiles generated via automatic keyword extraction, and vice versa. The results suggest that user profiles generated from one method can be used as a seed or bootstrapping proxy for the other method.
A Comparison of Information Seeking Using Search Engines and Social Networks
Morris, Meredith Ringel (Microsoft Research) | Teevan, Jaime (Microsoft Research) | Panovich, Katrina (Massachusetts Institute of Technology)
The Web has become an important information repository; often it is the first source a person turns to with an informa-tion need. One common way to search the Web is with a search engine. However, it is not always easy for people to find what they are looking for with keyword search, and at times the desired information may not be readily available online. An alternative, facilitated by the rise of social media, is to pose a question to one‟s online social network. In this paper, we explore the pros and cons of using a social net-working tool to fill an information need, as compared with a search engine. We describe a study in which 12 participants searched the Web while simultaneously posing a question on the same topic to their social network, and we compare the results they found by each method.