Information Retrieval
Minimally Invasive Randomization for Collecting Unbiased Preferences from Clickthrough Logs
Radlinski, Filip, Joachims, Thorsten
Clickthrough data is a particularly inexpensive and plentiful resource to obtain implicit relevance feedback for improving and personalizing search engines. However, it is well known that the probability of a user clicking on a result is strongly biased toward documents presented higher in the result set irrespective of relevance. We introduce a simple method to modify the presentation of search results that provably gives relevance judgments that are unaffected by presentation bias under reasonable assumptions. We validate this property of the training data in interactive real world experiments. Finally, we show that using these unbiased relevance judgments learning methods can be guaranteed to converge to an ideal ranking given sufficient data.
Evaluating the Robustness of Learning from Implicit Feedback
Radlinski, Filip, Joachims, Thorsten
This paper evaluates the robustness of learning from implicit feedback in web search. In particular, we create a model of user behavior by drawing upon user studies in laboratory and real-world settings. The model is used to understand the effect of user behavior on the performance of a learning algorithm for ranked retrieval. We explore a wide range of possible user behaviors and find that learning from implicit feedback can be surprisingly robust. This complements previous results that demonstrated our algorithm's effectiveness in a real-world search engine application.
Query Chains: Learning to Rank from Implicit Feedback
Radlinski, Filip, Joachims, Thorsten
This paper presents a novel approach for using clickthrough data to learn ranked retrieval functions for web search results. We observe that users searching the web often perform a sequence, or chain, of queries with a similar information need. Using query chains, we generate new types of preference judgments from search engine logs, thus taking advantage of user intelligence in reformulating queries. To validate our method we perform a controlled user study comparing generated preference judgments to explicit relevance judgments. We also implemented a real-world search engine to test our approach, using a modified ranking SVM to learn an improved ranking function from preference data. Our results demonstrate significant improvements in the ranking given by the search engine. The learned rankings outperform both a static ranking function, as well as one trained without considering query chains.
A Knowledge-Based Approach for Selecting Information Sources
Eiter, Thomas, Fink, Michael, Tompits, Hans
Through the Internet and the World-Wide Web, a vast number of information sources has become available, which offer information on various subjects by different providers, often in heterogeneous formats. This calls for tools and methods for building an advanced information-processing infrastructure. One issue in this area is the selection of suitable information sources in query answering. In this paper, we present a knowledge-based approach to this problem, in the setting where one among a set of information sources (prototypically, data repositories) should be selected for evaluating a user query. We use extended logic programs (ELPs) to represent rich descriptions of the information sources, an underlying domain theory, and user queries in a formal query language (here, XML-QL, but other languages can be handled as well). Moreover, we use ELPs for declarative query analysis and generation of a query description. Central to our approach are declarative source-selection programs, for which we define syntax and semantics. Due to the structured nature of the considered data items, the semantics of such programs must carefully respect implicit context information in source-selection rules, and furthermore combine it with possible user preferences. A prototype implementation of our approach has been realized exploiting the DLV KR system and its plp front-end for prioritized ELPs. We describe a representative example involving specific movie databases, and report about experimental results.
Better than the real thing? Iterative pseudo-query processing using cluster-based language models
Kurland, Oren, Lee, Lillian, Domshlak, Carmel
We present a novel approach to pseudo-feedback-based ad hoc retrieval that uses language models induced from both documents and clusters. First, we treat the pseudo-feedback documents produced in response to the original query as a set of pseudo-queries that themselves can serve as input to the retrieval process. Observing that the documents returned in response to the pseudo-queries can then act as pseudo-queries for subsequent rounds, we arrive at a formulation of pseudo-query-based retrieval as an iterative process. Experiments show that several concrete instantiations of this idea, when applied in conjunction with techniques designed to heighten precision, yield performance results rivaling those of a number of previously-proposed algorithms, including the standard language-modeling approach. The use of cluster-based language models is a key contributing factor to our algorithms' success.
PageRank without hyperlinks: Structural re-ranking using links induced by language models
Inspired by the PageRank and HITS (hubs and authorities) algorithms for Web search, we propose a structural re-ranking approach to ad hoc information retrieval: we reorder the documents in an initially retrieved set by exploiting asymmetric relationships between them. Specifically, we consider generation links, which indicate that the language model induced from one document assigns high probability to the text of another; in doing so, we take care to prevent bias against long documents. We study a number of re-ranking criteria based on measures of centrality in the graphs formed by generation links, and show that integrating centrality into standard language-model-based retrieval is quite effective at improving precision at top ranks.
A Delta Debugger for ILP Query Execution
Troncon, Remko, Janssens, Gerda
Because query execution is the most crucial part of Inductive Logic Programming (ILP) algorithms, a lot of effort is invested in developing faster execution mechanisms. These execution mechanisms typically have a low-level implementation, making them hard to debug. Moreover, other factors such as the complexity of the problems handled by ILP algorithms and size of the code base of ILP data mining systems make debugging at this level a very difficult job. In this work, we present the trace-based debugging approach currently used in the development of new execution mechanisms in hipP, the engine underlying the ACE Data Mining system. This debugger uses the delta debugging algorithm to automatically reduce the total time needed to expose bugs in ILP execution, thus making manual debugging step much lighter.
On Chase Termination Beyond Stratification
Meier, Michael, Schmidt, Michael, Lausen, Georg
We study the termination problem of the chase algorithm, a central tool in various database problems such as the constraint implication problem, Conjunctive Query optimization, rewriting queries using views, data exchange, and data integration. The basic idea of the chase is, given a database instance and a set of constraints as input, to fix constraint violations in the database instance. It is well-known that, for an arbitrary set of constraints, the chase does not necessarily terminate (in general, it is even undecidable if it does or not). Addressing this issue, we review the limitations of existing sufficient termination conditions for the chase and develop new techniques that allow us to establish weaker sufficient conditions. In particular, we introduce two novel termination conditions called safety and inductive restriction, and use them to define the so-called T-hierarchy of termination conditions. We then study the interrelations of our termination conditions with previous conditions and the complexity of checking our conditions. This analysis leads to an algorithm that checks membership in a level of the T-hierarchy and accounts for the complexity of termination conditions. As another contribution, we study the problem of data-dependent chase termination and present sufficient termination conditions w.r.t. fixed instances. They might guarantee termination although the chase does not terminate in the general case. As an application of our techniques beyond those already mentioned, we transfer our results into the field of query answering over knowledge bases where the chase on the underlying database may not terminate, making existing algorithms applicable to broader classes of constraints.
Pedagogical Discourse: Connecting Students to Past Discussions and Peer Mentors within an Online Discussion Board
The goal of the Pedagogical Discourse project is to develop instructional tools that will help students and instructors use discussion boards more effectively, with an emphasis on automatically assessing discussion activities and building tools for promoting student discussion participation and learning. In this paper, we present a two related participation and learning scaffolding tools that exploit natural language processing and information retrieval techniques. The PedaBot tool is designed to aid student knowledge acquisition and promote reflection about course topics by connecting related discussions from a knowledge base of past discussions to the current discussion thread. The MentorMatch tool aims at promoting student participation using student mentors, i.e., course peers with a relatively good understanding of a particular topic. The system identifies students who often provide answers on a given topic and encourages classmates to invite mentors to participate in related discussions. Both tools have been integrated into a live discussion board that is used by an undergraduate computer science course. This paper describes our approaches to applying information retrieval and natural language processing techniques in the development of the tools and presents initial results from instrumentation and survey.
Conjunctive Query Answering in the Description Logic EL using a Relational Database System
Lutz, Carsten (University of Bremen) | Toman, David (University of Waterloo) | Wolter, Frank (University of Liverpool)
Conjunctive queries (CQ) are fundamental for accessing description logic (DL) knowledge bases. We study CQ answering in (extensions of) the DL EL, which is popular for large-scale ontologies and underlies the designated OWL2-EL profile of OWL2. Our main contribution is a novel approach to CQ answering that enables the use of standard relational database systems as the basis for query execution. We evaluate our approach using the IBM DB2 system, with encouraging results.