Information Retrieval
ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution
Bahmani, Zeinab, Bertossi, Leopoldo, Vasiloglou, Nikolaos
Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating three components of ER: (a) Classifiers for duplicate/non-duplicate record pairs built using machine learning (ML) techniques, (b) MDs for supporting both the blocking phase of ML and the merge itself; and (c) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for data processing, and the specification and enforcement of MDs.
Progressive EM for Latent Tree Models and Hierarchical Topic Detection
Chen, Peixian, Zhang, Nevin L., Poon, Leonard K. M., Chen, Zhourong
Hierarchical latent tree analysis (HLTA) is recently proposed as a new method for topic detection. It differs fundamentally from the LDA-based methods in terms of topic definition, topic-document relationship, and learning method. It has been shown to discover significantly more coherent topics and better topic hierarchies. However, HLTA relies on the Expectation-Maximization (EM) algorithm for parameter estimation and hence is not efficient enough to deal with large datasets. In this paper, we propose a method to drastically speed up HLTA using a technique inspired by recent advances in the moments method. Empirical experiments show that our method greatly improves the efficiency of HLTA. It is as efficient as the state-of-the-art LDA-based method for hierarchical topic detection and finds substantially better topics and topic hierarchies.
The RatioLog Project: Rational Extensions of Logical Reasoning
Furbach, Ulrich, Schon, Claudia, Stolzenburg, Frieder, Weis, Karl-Heinz, Wirth, Claus-Peter
Higher-level cognition includes logical reasoning and the ability of question answering with common sense. The RatioLog project addresses the problem of rational reasoning in deep question answering by methods from automated deduction and cognitive computing. In a first phase, we combine techniques from information retrieval and machine learning to find appropriate answer candidates from the huge amount of text in the German version of the free encyclopedia "Wikipedia". In a second phase, an automated theorem prover tries to verify the answer candidates on the basis of their logical representations. In a third phase - because the knowledge may be incomplete and inconsistent -, we consider extensions of logical reasoning to improve the results. In this context, we work toward the application of techniques from human reasoning: We employ defeasible reasoning to compare the answers w.r.t. specificity, deontic logic, normative reasoning, and model construction. Moreover, we use integrated case-based reasoning and machine learning techniques on the basis of the semantic structure of the questions and answer candidates to learn giving the right answers.
On the Static Analysis for SPARQL Queries Using Modal Logic
Guido, Nicola (Universite de Grenoble)
Static analysis is a core task in query optimization and knowledge base verification. We study static analysis techniques for SPARQL, the standard language for querying Semantic Web data. Specifically, we investigate the query containment problem and query-update independence analysis. We are interested in developing techniques through reductions to the validity problem in logic.
How to Define Certain Answers
Libkin, Leonid (University of Edinburgh)
The standard way of answering queries over incomplete databases is to compute certain answers, defined as the intersection of query answers on all complete databases that the incomplete database represents. But is this universally accepted definition correct? We argue that this ``one-size-fits-all'' definition can often lead to counterintuitive or just plain wrong results, and propose an alternative framework for defining certain answers. We combine three previously used approaches, based on the semantics and representation systems, on ordering incomplete databases in terms of their informativeness, and on viewing databases as knowledge expressed in a logical language, to come up with a well justified and principled notion of certain answers. Using it, we show that for queries satisfying some natural conditions (like not losing information if a more informative input is given), computing certain answers is surprisingly easy, and avoids the complexity issues that have been associated with the classical definition.
Mobile Query Recommendation via Tensor Function Learning
Zhao, Zhou (Zhejiang University) | Song, Ruihua (Microsoft Research, Beijing) | Xie, Xing (Microsoft Research, Beijing) | He, Xiaofei (Zhejiang University) | Zhuang, Yueting (Zhejiang University)
With the prevalence of mobile search nowadays, the benefits of mobile query recommendation are well recognized, which provide formulated queries sticking to users’ search intent. In this paper, we introduce the problem of query recommendation on mobile devices and model the user-location-query relations with a tensor representation. Unlike previous studies based on tensor decomposition, we study this problem via tensor function learning. That is, we learn the tensor function from the side information of users, locations and queries, and then predict users’ search intent. We develop an efficient alternating direction method of multipliers (ADMM) scheme to solve the introduced problem. We empirically evaluate our approach based on the mobile query dataset from Bing search engine in the city of Beijing, China, and show that our method can outperform several state-of-the-art approaches.
Modeling Quantum Entanglements in Quantum Language Models
Xie, Mengjiao (Tianjin University) | Hou, Yuexian (Tianjin University) | Zhang, Peng (Tianjin University) | Li, Jingfei (Tianjin University) | Li, Wenjie (The Hong Kong Polytechnic University) | Song, Dawei (Tianjin University)
Recently, a Quantum Language Model (QLM) was proposed to model term dependencies upon Quantum Theory (QT) framework and successively applied in Information Retrieval (IR). Nevertheless, QLM's dependency is based on co-occurrences of terms and has not yet taken into account the Quantum Entanglement (QE), which is a key quantum concept and has a significant cognitive implication. In QT, an entangled state can provide a more complete description for the nature of realities, and determine intrinsic correlations of considered objects globally, rather than those co-occurrences on the surface. It is, however, a real challenge to decide and measure QE using the classical statistics of texts in a post-measurement configuration. In order to circumvent this problem, we theoretically prove the connection between QE and statistically Unconditional Pure Dependence (UPD). Since UPD has an implementable deciding algorithm, we can in turn characterize QE by extracting the UPD patterns from texts. This leads to a measurable QE, based on which we further advance the existing QLM framework. We empirically compare our model with related models, and the results demonstrate the effectiveness of our model.
The Right to Obscure: A Mechanism and Initial Evaluation
Huang, Eric Hsin-Chun (Stanford University) | Lanier, Jaron (Microsoft Research) | Shoham, Yoav (Stanford University)
The recent landmark "right to be forgotten" ruling by the EU Court gives EU citizens the right to remove certain links that are "inaccurate, inadequate, irrelevant or excessive" from search results under their names. While we agree with the spirit of the ruling — to empower individuals to manage their personal data while keeping a balance between such right and the freedom of expression, we believe that the ruling is impractical as it provides neither precise criteria for evaluating removal requests nor concrete guidelines for implementation. Consequently, Google's current implementation has several problems concerning scalability, objectivity, and responsiveness. Instead of the right to be forgotten, we propose the right to obscure certain facts about oneself on search engines, and a simple mechanism which respects the spirit of the ruling by giving people more power to influence search results for queries on their names. Specifically, under our proposed mechanism, data subjects will be able to register minus terms, and search results for their name queries that contain such terms would be filtered out. We implement a proof-of-concept search engine following the proposed mechanism, and conduct experiments to explore the influences it might have on users' impressions on different data subjects.
Coactive Learning
Shivaswamy, Pannaga, Joachims, Thorsten
We propose Coactive Learning as a model of interaction between a learning system and a human user, where both have the common goal of providing results of maximum utility to the user. Interactions in the Coactive Learning model take the following form: at each step, the system (e.g. search engine) receives a context (e.g. query) and predicts an object (e.g. ranking); the user responds by correcting the system if necessary, providing a slightly improved but not necessarily optimal object as feedback. We argue that such preference feedback can be inferred in large quantity from observable user behavior (e.g., clicks in web search), unlike the optimal feedback required in the expert model or the cardinal valuations required for bandit learning. Despite the relaxed requirements for the feedback, we show that it is possible to adapt many existing online learning algorithms to the coactive framework. In particular, we provide algorithms that achieve square root regret in terms of cardinal utility, even though the learning algorithm never observes cardinal utility values directly. We also provide an algorithm with logarithmic regret in the case of strongly convex loss functions. An extensive empirical study demonstrates the applicability of our model and algorithms on a movie recommendation task, as well as ranking for web search.
Topic Extraction and Bundling of Related Scientific Articles
Automatic classification of scientific articles based on common characteristics is an interesting problem with many applications in digital library and information retrieval systems. Properly organized articles can be useful for automatic generation of taxonomies in scientific writings, textual summarization, efficient information retrieval etc. Generating article bundles from a large number of input articles, based on the associated features of the articles is tedious and computationally expensive task. In this report we propose an automatic two-step approach for topic extraction and bundling of related articles from a set of scientific articles in real-time. For topic extraction, we make use of Latent Dirichlet Allocation (LDA) topic modeling techniques and for bundling, we make use of hierarchical agglomerative clustering techniques. We run experiments to validate our bundling semantics and compare it with existing models in use. We make use of an online crowdsourcing marketplace provided by Amazon called Amazon Mechanical Turk to carry out experiments. We explain our experimental setup and empirical results in detail and show that our method is advantageous over existing ones.