Query Processing
Query Complexity of k-NN based Mode Estimation
Singhal, Anirudh, Pirojiwala, Subham, Karamchandani, Nikhil
Motivated by the mode estimation problem of an unknown multivariate probability density function, we study the problem of identifying the point with the minimum k-th nearest neighbor distance for a given dataset of n points. We study the case where the pairwise distances are apriori unknown, but we have access to an oracle which we can query to get noisy information about the distance between any pair of points. For two natural oracle models, we design a sequential learning algorithm, based on the idea of confidence intervals, which adaptively decides which queries to send to the oracle and is able to correctly solve the problem with high probability. We derive instance-dependent upper bounds on the query complexity of our proposed scheme and also demonstrate significant improvement over the performance of other baselines via extensive numerical evaluations.
Knowledge Graph-based Question Answering with Electronic Health Records
Park, Junwoo, Cho, Youngwoo, Lee, Haneol, Choo, Jaegul, Choi, Edward
Question Answering (QA) on Electronic Health Records (EHR), namely EHR QA, can work as a crucial milestone towards developing an intelligent agent in healthcare. EHR data are typically stored in a relational database, which can also be converted to a Directed Acyclic Graph (DAG), allowing two approaches for EHR QA: Table-based QA and Knowledge Graph-based QA. We hypothesize that the graph-based approach is more suitable for EHR QA as graphs can represent relations between entities and values more naturally compared to tables, which essentially require JOIN operations. To validate our hypothesis, we first construct EHR QA datasets based on MIMIC-III, where the same question-answer pairs are represented in SQL (table-based) and SPARQL (graph-based), respectively. We then test a state-of-the-art EHR QA model on both datasets where the model demonstrated superior QA performance on the SPARQL version. Finally, we open-source both MIMICSQL* and MIMIC-SPARQL* to encourage further EHR QA research in both direction
Mastering Presto: Hands-On Learning
Mastering Presto: Hands-On Learning Learn Presto - distributed SQL Query Engine for Big Data! Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organisations like Facebook. In the first part of the course I will talk about Presto's theory including Presto's architecture and components - coordinator, worker, connector, query execution model, etc. Additionally, I will explain to you how Kafka, Cassandra, Hive, PostgreSQL and Redshift work before I mention the specifics to their connectors.
Query complexity of adversarial attacks
Głuch, Grzegorz, Urbanke, Rüdiger
The decision boundary of a learning algorithm applied to a given task can be viewed as the outcome of a random process: (i) generate a training set and, (ii) apply to it the, potentially randomized, learning algorithm. Recall, see Definitions 4 and 5, that a query-bounded adversary does not know the sample on which the model was trained nor the randomness used by the learner. This means that if the decision boundary has high entropy then the adversary needs to ask many questions to recover the boundary to a high degree of precision. This suggest that high-entropy decision boundaries are robust against query-bounded adversaries since intuitively it is clear that an approximate knowledge of the decision boundary is a prerequisite for a successful attack. Following this reasoning, we present two instances where high entropy of the decision boundary leads to security.
Revealing Secrets in SPARQL Session Level
Zhang, Xinyue, Wang, Meng, Saleem, Muhammad, Ngomo, Axel-Cyrille Ngonga, Qi, Guilin, Wang, Haofen
Based on Semantic Web technologies, knowledge graphs help users to discover information of interest by using live SPARQL services. Answer-seekers often examine intermediate results iteratively and modify SPARQL queries repeatedly in a search session. In this context, understanding user behaviors is critical for effective intention prediction and query optimization. However, these behaviors have not yet been researched systematically at the SPARQL session level. This paper reveals the secrets of session-level user search behaviors by conducting a comprehensive investigation over massive real-world SPARQL query logs. In particular, we thoroughly assess query changes made by users w.r.t. structural and data-driven features of SPARQL queries. To illustrate the potentiality of our findings, we employ a proof-of-concept model to predict user intentions, i.e., future directions of the given session, and give reformulation suggestions based on the predicted intention. We hope the results presented here will help to devise efficient SPARQL caching, auto-completion, query suggestion, approximation, and relaxation techniques in the future.
HyperBench: A Benchmark and Tool for Hypergraphs and Empirical Findings
Fischl, Wolfgang, Gottlob, Georg, Longo, Davide Mario, Pichler, Reinhard
To cope with the intractability of answering Conjunctive Queries (CQs) and solving Constraint Satisfaction Problems (CSPs), several notions of hypergraph decompositions have been proposed -- giving rise to different notions of width, noticeably, plain, generalized, and fractional hypertree width (hw, ghw, and fhw). Given the increasing interest in using such decomposition methods in practice, a publicly accessible repository of decomposition software, as well as a large set of benchmarks, and a web-accessible workbench for inserting, analyzing, and retrieving hypergraphs are called for. We address this need by providing (i) concrete implementations of hypergraph decompositions (including new practical algorithms), (ii) a new, comprehensive benchmark of hypergraphs stemming from disparate CQ and CSP collections, and (iii) HyperBench, our new web-inter\-face for accessing the benchmark and the results of our analyses. In addition, we describe a number of actual experiments we carried out with this new infrastructure.
Hasura introduces a GraphQL-based data virtualization cloud
GraphQL, which has nothing to do with graph databases but has everything to do with JSON query, was a technology originated at Facebook to simplify getting access to data. Now, Hasura, a two-year old company, is building a data virtualization tool around it and last week introduced a cloud service. At the core of Hasura's offerings is its own implementation of a GraphQL query engine that it has open sourced. Currently designed to work against PostgreSQL-compatible databases, the engine scans the metadata of specified tables on the target, builds a GraphQL endpoint, represents the data in JSON format, and automatically generates a selection of possible queries. Atop its GraphQL query engine, the original Hasura enterprise on-premises offering includes adapters for PostgreSQL databases; other relational targets are planned in the future.
NeuralQA: A Usable Library for Question Answering (Contextual Query Expansion + BERT) on Large Datasets
Existing tools for Question Answering (QA) have challenges that limit their use in practice. They can be complex to set up or integrate with existing infrastructure, do not offer configurable interactive interfaces, and do not cover the full set of subtasks that frequently comprise the QA pipeline (query expansion, retrieval, reading, and explanation/sensemaking). To help address these issues, we introduce NeuralQA - a usable library for QA on large datasets. NeuralQA integrates well with existing infrastructure (e.g., ElasticSearch instances and reader models trained with the HuggingFace Transformers API) and offers helpful defaults for QA subtasks. It introduces and implements contextual query expansion (CQE) using a masked language model (MLM) as well as relevant snippets (RelSnip) - a method for condensing large documents into smaller passages that can be speedily processed by a document reader model. Finally, it offers a flexible user interface to support workflows for research explorations (e.g., visualization of gradient-based explanations to support qualitative inspection of model behaviour) and large scale search deployment. Code and documentation for NeuralQA is available as open source on Github.
Query complexity of heavy hitter estimation
Sarmasarkar, Sahasrajit, Reddy, Kota Srinivas, Karamchandani, Nikhil
We consider the problem of identifying the subset $\mathcal{S}^{\gamma}_{\mathcal{P}}$ of elements in the support of an underlying distribution $\mathcal{P}$ whose probability value is larger than a given threshold $\gamma$, by actively querying an oracle to gain information about a sequence $X_1, X_2, \ldots$ of $i.i.d.$ samples drawn from $\mathcal{P}$. We consider two query models: $(a)$ each query is an index $i$ and the oracle return the value $X_i$ and $(b)$ each query is a pair $(i,j)$ and the oracle gives a binary answer confirming if $X_i = X_j$ or not. For each of these query models, we design sequential estimation algorithms which at each round, either decide what query to send to the oracle depending on the entire history of responses or decide to stop and output an estimate of $\mathcal{S}^{\gamma}_{\mathcal{P}}$, which is required to be correct with some pre-specified large probability. We provide upper bounds on the query complexity of the algorithms for any distribution $\mathcal{P}$ and also derive lower bounds on the optimal query complexity under the two query models. We also consider noisy versions of the two query models and propose robust estimators which can effectively counter the noise in the oracle responses.
Query Reformulation using Query History for Passage Retrieval in Conversational Search
Lin, Sheng-Chieh, Yang, Jheng-Hong, Nogueira, Rodrigo, Tsai, Ming-Feng, Wang, Chuan-Ju, Lin, Jimmy
Passage retrieval in a conversational context is essential for many downstream applications; it is however extremely challenging due to limited data resources. To address this problem, we present an effective multi-stage pipeline for passage ranking in conversational search that integrates a widely-used IR system with a conversational query reformulation module. Along these lines, we propose two simple yet effective query reformulation approaches: historical query expansion (HQE) and neural transfer reformulation (NTR). Whereas HQE applies query expansion, a traditional IR query reformulation technique, NTR transfers human knowledge of conversational query understanding to a neural query reformulation model. The proposed HQE method was the top-performing submission of automatic systems in CAsT Track at TREC 2019. Building on this, our NTR approach improves an additional 18% over that best entry in terms of NDCG@3. We further analyze the distinct behaviors of the two approaches, and show that fusing their output reduces the performance gap (measured in NDCG@3) between the manually-rewritten and automatically-generated queries to 4 from 22 points when compared with the best CAsT submission.