Goto

Collaborating Authors

 Information Retrieval


Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS)

arXiv.org Machine Learning

Recently it was shown that the problem of Maximum Inner Product Search (MIPS) is efficient and it admits provably sub-linear hashing algorithms. Asymmetric transformations before hashing were the key in solving MIPS which was otherwise hard. In the prior work, the authors use asymmetric transformations which convert the problem of approximate MIPS into the problem of approximate near neighbor search which can be efficiently solved using hashing. In this work, we provide a different transformation which converts the problem of approximate MIPS into the problem of approximate cosine similarity search which can be efficiently solved using signed random projections. Theoretical analysis show that the new scheme is significantly better than the original scheme for MIPS. Experimental evaluations strongly support the theoretical findings.


Mining Large-Scale Knowledge Graphs to Discover Inference Paths for Query Expansion in NLIDB

AAAI Conferences

In this paper, we present an approach to mine large-scale knowledge graphs to discover inference paths for query expansion in NLIDB (Natural Language Interface to Databases). Addressing this problem is important in order for NLIDB applications to effectively handle relevant concepts in the domain of interest that do not correspond to any structured fields in the target database. We also present preliminary observations on the performance of our approach applied to Freebase, and conclude with discussions on next steps to further evaluate and extend our approach.


Ontology-Based Translation of Natural Language Queries to SPARQL

AAAI Conferences

We present an implemented approach to transform natural language sentences into SPARQL, using background knowledge from ontologies and lexicons. Therefore, eligible technologies and data storage possibilities are analyzed and evaluated. The contributions of this paper are twofold. Firstly, we describe the motivation and current needs for a natural language access to industry data. We describe several scenarios where the proposed solution is required. Resulting in an architectural approach based on automatic SPARQL query construction for effective natural language queries. Secondly, we analyze the performance of RDBMS, RDF and Triple Stores for the knowledge representation. The proposed approach will be evaluated on the basis of a query catalog by means of query efficiency, accuracy, and data storage performance. The results show, that natural language access to industry data using ontologies and lexicons, is a simple but effective approach to improve the diagnosis process and the data search for a broad range of users. Furthermore, virtual RDF graphs do support the DB-driven knowledge graph representation process, but do not perform efficient under industry conditions in terms of performance and scalability.


Algorithm Selection for Combinatorial Search Problems: A Survey

AI Magazine

The algorithm selection problem is concerned with selecting the best algorithm to solve a given problem instance on a case-by-case basis. It has become especially relevant in the last decade, with researchers increasingly investigating how to identify the most suitable existing algorithm for solving a problem instance instead of developing new algorithms. This survey presents an overview of this work focusing on the contributions made in the area of combinatorial search problems, where algorithm selection techniques have achieved significant performance improvements. We unify and organise the vast literature according to criteria that determine algorithm selection systems in practice.


Topic Similarity Networks: Visual Analytics for Large Document Sets

arXiv.org Machine Learning

We investigate ways in which to improve the interpretability of LDA topic models by better analyzing and visualizing their outputs. We focus on examining what we refer to as topic similarity networks: graphs in which nodes represent latent topics in text collections and links represent similarity among topics. We describe efficient and effective approaches to both building and labeling such networks. Visualizations of topic models based on these networks are shown to be a powerful means of exploring, characterizing, and summarizing large collections of unstructured text documents. They help to "tease out" non-obvious connections among different sets of documents and provide insights into how topics form larger themes. We demonstrate the efficacy and practicality of these approaches through two case studies: 1) NSF grants for basic research spanning a 14 year period and 2) the entire English portion of Wikipedia.


Solving the Target-Value Search Problem

AAAI Conferences

This paper addresses the Target-Value Search (TVS) problem, which is the problem of finding a path between two nodes in a graph whose cost is as close as possible to a given target value, T. This problem has been previously addressed: first, for directed acyclic graphs; second, for general graphs under the assumption that nodes can be revisited given that the same edge can not be traversed twice. In this work we focus on a more restrictive variant of the same problem where nodes can not be revisited. We prove that this variant is NP-complete and discuss novel theoretical properties and provide empirical results to solve this problem optimally.


CiteSeerX: AI in a Digital Library Search Engine

AAAI Conferences

CiteSeerX is a digital library search engine that provides access to more than 4 million academic documents with nearly a million users and millions of hits per day. Artificial intelligence (AI) technologies are used in many components of CiteSeerX, e.g. to accurately extract metadata, intelligently crawl the web, and ingest documents. We present key AI technologies used in the following components: document classification and deduplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation. These AI technologies have been developed by CiteSeerX group members over the past 5–6 years. We also show the usage status, payoff, development challenges, main design concepts, and deployment and maintenance requirements. While it is challenging to rebuild a system like CiteSeerX from scratch, many of these AI technologies are transferable to other digital libraries and/or search engines.


Cost-Based Query Optimization via AI Planning

AAAI Conferences

In this paper we revisit the problem of generating query plans using AI automated planning with a view to leveraging significant recent advances in state-of-the-art planning techniques. Our efforts focus on the specific problem of cost-based join-order optimization for conjunctive relational queries, a critical component of production-quality query optimizers. We characterize the general query-planning problem as a delete-free planning problem, and query plan optimization as a context-sensitive cost-optimal planning problem. We propose algorithms that generate high-quality query plans, guaranteeing optimality under certain conditions. Our approach is general, supporting the use of a broad suite of domain-independent and domain-specific optimization criteria. Experimental results demonstrate the effectiveness of AI planning techniques for query plan generation and optimization.


Extracting Keyphrases from Research Papers Using Citation Networks

AAAI Conferences

Keyphrases for a document concisely describe the document using a small set of phrases. Keyphrases were previously shown to improve several document processing and retrieval tasks. In this work, we study keyphrase extraction from research papers by leveraging citation networks. We propose CiteTextRank for keyphrase extraction from research articles, a graph-based algorithm that incorporates evidence from both a document's content as well as the contexts in which the document is referenced within a citation network. Our model obtains significant improvements over the state-of-the-art models for this task. Specifically, on several datasets of research papers, CiteTextRank improves precision at rank 1 by as much as 9-20% over state-of-the-art baselines.


Learning Concept Embeddings for Query Expansion by Quantum Entropy Minimization

AAAI Conferences

In web search, users queries are formulated using only few terms and term-matching retrieval functions could fail at retrieving relevant documents. Given a user query, the technique of query expansion (QE) consists in selecting related terms that could enhance the likelihood of retrieving relevant documents. Selecting such expansion terms is challenging and requires a computational framework capable of encoding complex semantic relationships. In this paper, we propose a novel method for learning, in a supervised way, semantic representations for words and phrases. By embedding queries and documents in special matrices, our model disposes of an increased representational power with respect to existing approaches adopting a vector representation. We show that our model produces high-quality query expansion terms. Our expansion increase IR measures beyond expansion from current word-embeddings models and well-established traditional QE methods.