Goto

Collaborating Authors

 Query Processing


700 SQL Queries per Second in Apache Spark with FiloDB

#artificialintelligence

Apache Spark is increasingly thought of as the new jack-of-all-trades distributed platform for big data crunching – what with everything from traditional MapReduce-like workloads, streaming, graph computation, statistics, and machine learning all in one package. Except for Spark Streaming, with its micro-batches, Spark is focused for the most part on higher-latency, rich/complex analytics workloads. What about using Spark as an embedded, web-speed / low-latency query engine? This post will dive into using Apache Spark for low-latency, higher concurrency reporting / dashboard / SQL-like applications - up to hundreds of queries a second! Launching Spark applications on a cluster, or even on localhost, has a pretty high overhead.


An Active Learning Framework using Sparse-Graph Codes for Sparse Polynomials and Graph Sketching

Neural Information Processing Systems

Let $f: \{-1,1\}^n \rightarrow \mathbb{R}$ be an $n$-variate polynomial consisting of $2^n$ monomials, in which only $s\ll 2^n$ coefficients are non-zero. The goal is to learn the polynomial by querying the values of $f$. We introduce an active learning framework that is associated with a low query cost and computational runtime. The significant savings are enabled by leveraging sampling strategies based on modern coding theory, specifically, the design and analysis of {\it sparse-graph codes}, such as Low-Density-Parity-Check (LDPC) codes, which represent the state-of-the-art of modern packet communications. More significantly, we show how this design perspective leads to exciting, and to the best of our knowledge, largely unexplored intellectual connections between learning and coding. The key is to relax the worst-case assumption with an ensemble-average setting, where the polynomial is assumed to be drawn uniformly at random from the ensemble of all polynomials (of a given size $n$ and sparsity $s$). Our framework succeeds with high probability with respect to the polynomial ensemble with sparsity up to $s={O}(2^{\delta n})$ for any $\delta\in(0,1)$, where $f$ is exactly learned using ${O}(ns)$ queries in time ${O}(n s \log s)$, even if the queries are perturbed by Gaussian noise. We further apply the proposed framework to graph sketching, which is the problem of inferring sparse graphs by querying graph cuts. By writing the cut function as a polynomial and exploiting the graph structure, we propose a sketching algorithm to learn the an arbitrary $n$-node unknown graph using only few cut queries, which scales {\it almost linearly} in the number of edges and {\it sub-linearly} in the graph size $n$. Experiments on real datasets show significant reductions in the runtime and query complexity compared with competitive schemes.


An End-to-End Conversational Second Screen Application for TV Program Discovery

AI Magazine

In this article, we report on a multiphase R&D effort to develop a conversational second screen application for TV program discovery. Our goal is to share with the community the breadth of artificial intelligence (AI) and natural language (NL) technologies required to develop such an application along with learnings from target end-users. We first give an overview of our application from the perspective of the end-user. We then present the architecture of our application along with the main AI and NL components, which were developed over multiple phases. The first phase focuses on enabling core functionality such as effectively finding programs matching the user’s intent. The second phase focuses on enabling dialog with the user. Finally, we present two user studies, corresponding to these two phases. The results from both studies demonstrate the effectiveness of our application in the target domain.


Lower and Upper Bounds for SPARQL Queries over OWL Ontologies

AAAI Conferences

The paper presents an approach for optimizing the evaluation of SPARQL queries over OWL ontologies using SPARQL's OWL Direct Semantics entailment regime. The approach is based on the computation of lower and upper bounds, but we allow for much more expressive queries than related approaches. In order to optimize the evaluation of possible query answers in the upper but not in the lower bound, we present a query extension approach that uses schema knowledge from the queried ontology to extend the query with additional parts. We show that the resulting query is equivalent to the original one and we use the additional parts that are simple to evaluate for restricting the bounds of subqueries of the initial query. In an empirical evaluation we show that the proposed query extension approach can lead to a significant decrease in the query execution time of up to four orders of magnitude.


Managing large-scale scientific hypotheses as uncertain and probabilistic data

arXiv.org Artificial Intelligence

In view of the paradigm shift that makes science ever more data-driven, in this thesis we propose a synthesis method for encoding and managing large-scale deterministic scientific hypotheses as uncertain and probabilistic data. In the form of mathematical equations, hypotheses symmetrically relate aspects of the studied phenomena. For computing predictions, however, deterministic hypotheses can be abstracted as functions. We build upon Simon's notion of structural equations in order to efficiently extract the (so-called) causal ordering between variables, implicit in a hypothesis structure (set of mathematical equations). We show how to process the hypothesis predictive structure effectively through original algorithms for encoding it into a set of functional dependencies (fd's) and then performing causal reasoning in terms of acyclic pseudo-transitive reasoning over fd's. Such reasoning reveals important causal dependencies implicit in the hypothesis predictive data and guide our synthesis of a probabilistic database. Like in the field of graphical models in AI, such a probabilistic database should be normalized so that the uncertainty arisen from competing hypotheses is decomposed into factors and propagated properly onto predictive data by recovering its joint probability distribution through a lossless join. That is motivated as a design-theoretic principle for data-driven hypothesis management and predictive analytics. The method is applicable to both quantitative and qualitative deterministic hypotheses and demonstrated in realistic use cases from computational science.


Report 79-30.pdf

AI Classics

An approach to query optimization is described that draws on two sources of knowledge: real world constraints on the values for the application domain served by the database; and knowledge about the current structure of the database and the cost of available retrieval processes. Real world knowledge is embodied in rules that are much like semantic integrity rules. The approach, called "query rephrasing", is to generate semantic equivalents of user queries that cost less to process than the original queries. The operation of a prototype system based on this approach is discussed in the context 0. simple queries which restrict a single file. The need for heuristics to limit the generation of equivalent queries is also discussed, and a method ut g "constraint thresholds" derived from


Ontology-Based Translation of Natural Language Queries to SPARQL

AAAI Conferences

We present an implemented approach to transform natural language sentences into SPARQL, using background knowledge from ontologies and lexicons. Therefore, eligible technologies and data storage possibilities are analyzed and evaluated. The contributions of this paper are twofold. Firstly, we describe the motivation and current needs for a natural language access to industry data. We describe several scenarios where the proposed solution is required. Resulting in an architectural approach based on automatic SPARQL query construction for effective natural language queries. Secondly, we analyze the performance of RDBMS, RDF and Triple Stores for the knowledge representation. The proposed approach will be evaluated on the basis of a query catalog by means of query efficiency, accuracy, and data storage performance. The results show, that natural language access to industry data using ontologies and lexicons, is a simple but effective approach to improve the diagnosis process and the data search for a broad range of users. Furthermore, virtual RDF graphs do support the DB-driven knowledge graph representation process, but do not perform efficient under industry conditions in terms of performance and scalability.


Mining Large-Scale Knowledge Graphs to Discover Inference Paths for Query Expansion in NLIDB

AAAI Conferences

In this paper, we present an approach to mine large-scale knowledge graphs to discover inference paths for query expansion in NLIDB (Natural Language Interface to Databases). Addressing this problem is important in order for NLIDB applications to effectively handle relevant concepts in the domain of interest that do not correspond to any structured fields in the target database. We also present preliminary observations on the performance of our approach applied to Freebase, and conclude with discussions on next steps to further evaluate and extend our approach.


Compact Aspect Embedding for Diversified Query Expansions

AAAI Conferences

Diversified query expansion (DQE) based approaches aim to select a set of expansion terms with less redundancy among them while covering as many query aspects as possible. Recently they have experimentally demonstrate their effectiveness for the task of search result diversification. One challenge faced by existing DQE approaches is how to ensure the aspect coverage. In this paper, we propose a novel method for DQE, called compact aspect embedding, which exploits trace norm regularization to learn a low rank vector space for the query, with each eigenvector of the learnt vector space representing an aspect, and the absolute value of its corresponding eigenvalue representing the association strength of that aspect to the query. Meanwhile, each expansion term is mapped into the vector space as well. Based on this novel representation of the query aspects and expansion terms, we design a greedy selection strategy to choose a set of expansion terms to explicitly cover all possible aspects of the query.We test our method on several TREC diversification data sets, and show that our method significantly outperforms the state-of-the-art search result diversification approaches.


Cost-Based Query Optimization via AI Planning

AAAI Conferences

In this paper we revisit the problem of generating query plans using AI automated planning with a view to leveraging significant recent advances in state-of-the-art planning techniques. Our efforts focus on the specific problem of cost-based join-order optimization for conjunctive relational queries, a critical component of production-quality query optimizers. We characterize the general query-planning problem as a delete-free planning problem, and query plan optimization as a context-sensitive cost-optimal planning problem. We propose algorithms that generate high-quality query plans, guaranteeing optimality under certain conditions. Our approach is general, supporting the use of a broad suite of domain-independent and domain-specific optimization criteria. Experimental results demonstrate the effectiveness of AI planning techniques for query plan generation and optimization.