Information Retrieval
Optimizing SPARQL Query Answering over OWL Ontologies
The SPARQL query language is currently being extended by the World Wide Web Consortium (W3C) with so-called entailment regimes. An entailment regime defines how queries are evaluated under more expressive semantics than SPARQL's standard simple entailment, which is based on subgraph matching. The queries are very expressive since variables can occur within complex concepts and can also bind to concept or role names. In this paper, we describe a sound and complete algorithm for the OWL Direct Semantics entailment regime. We further propose several novel optimizations such as strategies for determining a good query execution order, query rewriting techniques, and show how specialized OWL reasoning tasks and the concept and role hierarchy can be used to reduce the query execution time. For determining a good execution order, we propose a cost-based model, where the costs are based on information about the instances of concepts and roles that are extracted from a model abstraction built by an OWL reasoner. We present two ordering strategies: a static and a dynamic one. For the dynamic case, we improve the performance by exploiting an individual clustering approach that allows for computing the cost functions based on one individual sample from a cluster. We provide a prototypical implementation and evaluate the efficiency of the proposed optimizations. Our experimental study shows that the static ordering usually outperforms the dynamic one when accurate statistics are available. This changes, however, when the statistics are less accurate, e.g., due to nondeterministic reasoning decisions. For queries that go beyond conjunctive instance queries we observe an improvement of up to three orders of magnitude due to the proposed optimizations.
Active Evaluation of Ranking Functions Based on Graded Relevance (Extended Abstract)
Sawade, Christoph (University of Potsdam) | Bickel, Steffen (Nokia gate5 GmbH) | Oertzen, Timo von (University of Virginia) | Scheffer, Tobias (University of Potsdam) | Landwehr, Niels (University of Potsdam)
Evaluating the quality of ranking functions is a core task in web search and other information retrieval domains. Because query distributions and item relevance change over time, ranking models often cannot be evaluated accurately on held-out training data. Instead, considerable effort is spent on manually labeling the relevance of query results for test queries in order to track ranking performance. We address the problem of estimating ranking performance as accurately as possible on a fixed labeling budget. Estimates are based on a set of most informative test queries selected by an active sampling distribution. Query labeling costs depend on the number of result items and item-specific attributes such as document length. We derive cost-optimal sampling distributions for commonly used ranking performance measures. Experiments on web search engine data illustrate significant reductions in labeling costs.
Integrating Semantic Relatedness and Words' Intrinsic Features for Keyword Extraction
Zhang, Wei (Tsinghua University) | Feng, Wei (Tsinghua University) | Wang, Jianyong (Tsinghua University)
Keyword extraction attracts much attention for its significant role in various natural language processing tasks. While some existing methods for keyword extraction have considered using single type of semantic relatedness between words or inherent attributes of words, almost all of them ignore two important issues: 1) how to fuse multiple types of semantic relations between words into a uniform semantic measurement and automatically learn the weights of the edges between the words in the word graph of each document, and 2) how to integrate the relations between words and words' intrinsic features into a unified model. In this work, we tackle the two issues based on the supervised random walk model. We propose a supervised ranking based method for keyword extraction, which is called SEAFARER. It can not only automatically learn the weights of the edges in the unified graph of each document which includes multiple semantic relations but also combine the merits of semantic relations of edges and intrinsic attributes of nodes together. We conducted extensive experimental study on an established benchmark and the experimental results demonstrate that SEAFARER outperforms the state-of-the-art supervised and unsupervised methods.
Harmonious Hashing
Xu, Bin (Zhejiang University) | Bu, Jiajun (Zhejiang University) | Lin, Yue (Zhejiang University) | Chen, Chun (Zhejiang University) | He, Xiaofei (Zhejiang University) | Cai, Deng (Zhejiang University)
Hashing-based fast nearest neighbor search technique has attracted great attention in both research and industry areas recently.Many existing hashing approaches encode data with projection-based hash functions and represent each projected dimension by 1-bit.However, the dimensions with high variance hold large energy or information of data but treated equivalently as dimensions with low variance,which leads to a serious information loss.In this paper, we introduce a novel hashing algorithm called Harmonious Hashing which aims at learning hash functions with low information loss.Specifically, we learn a set of optimized projections to preserve the maximum cumulative energy and meet the constraint of equivalent variance on each dimension as much as possible.In this way, we could minimize the information loss after binarization.Despite the extreme simplicity, our method outperforms superiorly to many state-of-the-art hashing methods in large-scale and high-dimensional nearest neighbor search experiments.
A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing
Zhang, Debing (Zhejiang University) | Yang, Genmao (Zhejiang University) | Hu, Yao (Zhejiang University) | Jin, Zhongming (Zhejiang University) | Cai, Deng (Zhejiang University) | He, Xiaofei (Zhejiang University)
Nowadays, Nearest Neighbor Search becomes more and more important when facing the challenge of big data. Traditionally, to solve this problem, researchers mainly focus on building effective data structures such as hierarchical k-means tree or using hashing methods to accelerate the query process. In this paper, we propose a novel unified approximate nearest neighbor search scheme to combine the advantages of both the effective data structure and the fast Hamming distance computation in hashing methods. In this way, the searching procedure can be further accelerated. Computational complexity analysis and extensive experiments have demonstrated the effectiveness of our proposed scheme.
Search Strategies for Optimal Multi-Way Number Partitioning
Moffitt, Michael D. (IBM Corp.)
The number partitioning problem seeks to divide a set of n numbers across k distinct subsets so as to minimize the sum of the largest partition. In this work, we develop a new optimal algorithm for multi-way number partitioning. A critical observation motivating our methodology is that a globally optimal k -way partition may be recursively constructed by obtaining suboptimal solutions to subproblems of size k – 1. We introduce a new principle of optimality that provides necessary and sufficient conditions for this construction, and use it to strengthen the relationship between sequential decompositions by enforcing upper and lower bounds on intermediate solutions. We also demonstrate how to further prune unpromising partial assignments by detecting and eliminating dominated solutions. Our approach outperforms the previous state-of-the-art by up to four orders of magnitude , reducing average runtime on the largest benchmarks from several hours to less than a second.
Topic Segmentation and Labeling in Asynchronous Conversations
Joty, S., Carenini, G., Ng, R. T.
Topic segmentation and labeling is often considered a prerequisite for higher-level conversation analysis and has been shown to be useful in many Natural Language Processing (NLP) applications. We present two new corpora of email and blog conversations annotated with topics, and evaluate annotator reliability for the segmentation and labeling tasks in these asynchronous conversations. We propose a complete computational framework for topic segmentation and labeling in asynchronous conversations. Our approach extends state-of-the-art methods by considering a fine-grained structure of an asynchronous conversation, along with other conversational features by applying recent graph-based methods for NLP. For topic segmentation, we propose two novel unsupervised models that exploit the fine-grained conversational structure, and a novel graph-theoretic supervised model that combines lexical, conversational and topic features. For topic labeling, we propose two novel (unsupervised) random walk models that respectively capture conversation specific clues from two different sources: the leading sentences and the fine-grained conversational structure. Empirical evaluation shows that the segmentation and the labeling performed by our best models beat the state-of-the-art, and are highly correlated with human annotations.
Data-Parallel Computing Meets STRIPS
Karpas, Erez (Technion - Israel Institute of Technology) | Sagi, Tomer (Technion - Israel Institute of Technology) | Domshlak, Carmel (Technion - Israel Institute of Technology) | Gal, Avigdor ( Technion - Israel Institute of Technology ) | Mendelson, Avi (Technion - Israel Institute of Technology ) | Tennenholtz, Moshe (Microsoft Research and Technion)
The increased demand for distributed computations on “big data” has led to solutions such as SCOPE, DryadLINQ, Pig, and Hive, which allow the user to specify queries in an SQL-like language, enriched with sets of user-defined operators. The lack of exact semantics for user-defined operators interferes with the query optimization process, thus putting the burden of suggesting, at least partial, query plans on the user. In an attempt to ease this burden, we propose a formal model that allows for data-parallel program synthesis (DPPS) in a semantically well-defined manner. We show that this model generalizes existing frameworks for data-parallel computation, while providing the flexibility of query plan generation that is currently absent from these frameworks. In particular, we show how existing, off-the-shelf, AI planning tools can be used for solving DPPS tasks.
RockIt: Exploiting Parallelism and Symmetry for MAP Inference in Statistical Relational Models
Noessner, Jan (University of Mannheim) | Niepert, Mathias (University of Washington) | Stuckenschmidt, Heiner (University of Mannheim)
RockIt is a maximum a-posteriori (MAP) query engine for statistical relational models. MAP inference in graphical models is an optimization problem which can be compiled to integer linear programs (ILPs).We describe several advances in translating MAP queries to ILP instances and present the novel meta-algorithm cutting plane aggregation (CPA). CPA exploits local context-specific symmetries and bundles up sets of linear constraints. The resulting counting constraints lead to more compact ILPs and make the symmetry of the ground model more explicit to state-of-the-art ILP solvers. Moreover, RockIt parallelizes most parts of the MAP inference pipeline taking advantage of ubiquitous shared-memory multi-core architectures. We report on extensive experiments with Markov logic network (MLN) benchmarks showing that RockIt outperforms the state-of-the-art systems Alchemy, Markov TheBeast, and Tuffy both in terms of efficiency and quality of results.
Reciprocal Hash Tables for Nearest Neighbor Search
Liu, Xianglong (Beihang University) | He, Junfeng (Columbia University and Facebook) | Lang, Bo (Beihang University)
Recent years have witnessed the success of hashingtechniques in approximate nearest neighbor search. Inpractice, multiple hash tables are usually employed toretrieve more desired results from all hit buckets ofeach table. However, there are rare works studying theunified approach to constructing multiple informativehash tables except the widely used random way. In thispaper, we regard the table construction as a selectionproblem over a set of candidate hash functions. Withthe graph representation of the function set, we proposean efficient solution that sequentially applies normal-ized dominant set to finding the most informative andindependent hash functions for each table. To furtherreduce the redundancy between tables, we explore thereciprocal hash tables in a boosting manner, where thehash function graph is updated with high weights em-phasized on the misclassified neighbor pairs of previoushash tables. The construction method is general andcompatible with different types of hashing algorithmsusing different feature spaces and/or parameter settings.Extensive experiments on two large-scale benchmarksdemonstrate that the proposed method outperforms bothnaive construction method and state-of-the-art hashingalgorithms, with up to 65.93% accuracy gains.