Query Processing
Sequence to Sequence Learning for Query Expansion
Using sequence to sequence algorithms for query expansion has not been explored yet in Information Retrieval literature nor in Question-Answering's. We tried to fill this gap in the literature with a custom Query Expansion engine trained and tested on open datasets. Starting from open datasets, we built a Query Expansion training set using sentence-embeddings-based Keyword Extraction. We therefore assessed the ability of the Sequence to Sequence neural networks to capture expanding relations in the words embeddings' space.
SWRL2SPIN: A tool for transforming SWRL rule bases in OWL ontologies to object-oriented SPIN rules
Semantic Web Rule Language (SWRL) combines OWL (Web Ontology Language) ontologies with Horn Logic rules of the Rule Markup Language (RuleML) family. Being supported by ontology editors, rule engines and ontology reasoners, it has become a very popular choice for developing rule-based applications on top of ontologies. However, SWRL is probably not go-ing to become a WWW Consortium standard, prohibiting industrial acceptance. On the other hand, SPIN (SPARQL Inferencing Notation) has become a de-facto industry standard to rep-resent SPARQL rules and constraints on Semantic Web models, building on the widespread acceptance of SPARQL (SPARQL Protocol and RDF Query Language). In this paper, we ar-gue that the life of existing SWRL rule-based ontology applications can be prolonged by con-verting them to SPIN. To this end, we have developed the SWRL2SPIN tool in Prolog that transforms SWRL rules into SPIN rules, considering the object-orientation of SPIN, i.e. linking rules to the appropriate ontology classes and optimizing them, as derived by analysing the rule conditions.
Learning to Rank Query Graphs for Complex Question Answering over Knowledge Graphs
Maheshwari, Gaurav, Trivedi, Priyansh, Lukovnikov, Denis, Chakraborty, Nilesh, Fischer, Asja, Lehmann, Jens
In this paper, we conduct an empirical investigation of neural query graph ranking approaches for the task of complex question answering over knowledge graphs. We experiment with six different ranking models and propose a novel self-attention based slot matching model which exploits the inherent structure of query graphs, our logical form of choice. Our proposed model generally outperforms the other models on two QA datasets over the DBpedia knowledge graph, evaluated in different settings. In addition, we show that transfer learning from the larger of those QA datasets to the smaller dataset yields substantial improvements, effectively offsetting the general lack of training data.
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
Lee, Yunseong, Scolari, Alberto, Chun, Byung-Gon, Santambrogio, Marco Domenico, Weimer, Markus, Interlandi, Matteo
Machine Learning models are often composed of pipelines of transformations. While this design allows to efficiently execute single model components at training time, prediction serving has different requirements such as low latency, high throughput and graceful performance degradation under heavy load. Current prediction serving systems consider models as black boxes, whereby prediction-time-specific optimizations are ignored in favor of ease of deployment. In this paper, we present PRETZEL, a prediction serving system introducing a novel white box architecture enabling both end-to-end and multi-model optimizations. Using production-like model pipelines, our experiments show that PRETZEL is able to introduce performance improvements over different dimensions; compared to state-of-the-art approaches PRETZEL is on average able to reduce 99th percentile latency by 5.5x while reducing memory footprint by 25x, and increasing throughput by 4.7x.
Datalog: Bag Semantics via Set Semantics
Bertossi, Leopoldo, Gottlob, Georg, Pichler, Reinhard
Duplicates in data management are common and problematic. In this work, we present a translation of Datalog under bag semantics into a well-behaved extension of Datalog (the so-called warded Datalog+-) under set semantics. From a theoretical point of view, this allows us to reason on bag semantics by making use of the well-established theoretical foundations of set semantics. From a practical point of view, this allows us to handle the bag semantics of Datalog by powerful, existing query engines for the required extension of Datalog. Moreover, this translation has the potential for further extensions -- above all to capture the bag semantics of the semantic web query language SPARQL.
MemSQL 6.5: NewSQL with autonomous workload optimization, improved data ingestion and query execution speed
Even though it's just naming conventions, MemSQL is usually thought of as part of the NewSQL lot. That is, databases grounded on the relational SQL model yet adopting some of the "knobs" the NoSQL lot has introduced for people to turn in order to achieve horizontal scalability. It's a balancing act between consistency and performance, which, in the end, applies to any distributed database. NewSQL folks argue you should not have to ditch schema, or SQL, to get the scale-out benefits that NoSQL brings. If you like your schema and SQL, and also like your database to have analytics as well as transactional capabilities, then MemSQL is probably in your radar.
Ontology-Based Query Expansion with Latently Related Named Entities for Semantic Text Search
Traditional information retrieval systems represent documents and queries by keyword sets. However, the content of a document or a query is mainly defined by both keywords and named entities occurring in it. Named entities have ontological features, namely, their aliases, classes, and identifiers, which are hidden from their textual appearance. Besides, the meaning of a query may imply latent named entities that are related to the apparent ones in the query. We propose an ontology-based generalized vector space model to semantic text search. It exploits ontological features of named entities and their latently related ones to reveal the semantics of documents and queries. We also propose a framework to combine different ontologies to take their complementary advantages for semantic annotation and searching. Experiments on a benchmark dataset show better search quality of our model to other ones.
TrQuery: An Embedding-based Framework for Recommanding SPARQL Queries
Zhang, Lijing, Zhang, Xiaowang, Feng, Zhiyong
In this paper, we present an embedding-based framework (TrQuery) for recommending solutions of a SPARQL query, including approximate solutions when exact querying solutions are not available due to incompleteness or inconsistencies of real-world RDF data. Within this framework, embedding is applied to score solutions together with edit distance so that we could obtain more fine-grained recommendations than those recommendations via edit distance. For instance, graphs of two querying solutions with a similar structure can be distinguished in our proposed framework while the edit distance depending on structural difference becomes unable. To this end, we propose a novel score model built on vector space generated in embedding system to compute the similarity between an approximate subgraph matching and a whole graph matching. Finally, we evaluate our approach on large RDF datasets DBpedia and YAGO, and experimental results show that TrQuery exhibits an excellent behavior in terms of both effectiveness and efficiency.
Presto for Data Scientists – SQL on anything
Initially developed by Facebook, Presto is an open source, distributed ANSI SQL query engine that delivers fast analytic queries against various data sources ranging in size from gigabytes to petabytes. For data scientists, this is ideal for returning Big Data query results in seconds, accelerating the iterative nature of data science discoveries by powering dashboards, reporting and ad-hoc analysis. Presto was designed and built from scratch to be a fast SQL query engine. It follows the classic MPP SQL engine design in which query processing is parallelized over a cluster of machines. As a result, highly concurrent queries execute at interactive speeds.
Tight Query Complexity Lower Bounds for PCA via Finite Sample Deformed Wigner Law
Simchowitz, Max, Alaoui, Ahmed El, Recht, Benjamin
We prove a \emph{query complexity} lower bound for approximating the top $r$ dimensional eigenspace of a matrix. We consider an oracle model where, given a symmetric matrix $\mathbf{M} \in \mathbb{R}^{d \times d}$, an algorithm $\mathsf{Alg}$ is allowed to make $\mathsf{T}$ exact queries of the form $\mathsf{w}^{(i)} = \mathbf{M} \mathsf{v}^{(i)}$ for $i$ in $\{1,...,\mathsf{T}\}$, where $\mathsf{v}^{(i)}$ is drawn from a distribution which depends arbitrarily on the past queries and measurements $\{\mathsf{v}^{(j)},\mathsf{w}^{(i)}\}_{1 \le j \le i-1}$. We show that for every $\mathtt{gap} \in (0,1/2]$, there exists a distribution over matrices $\mathbf{M}$ for which 1) $\mathrm{gap}_r(\mathbf{M}) = \Omega(\mathtt{gap})$ (where $\mathrm{gap}_r(\mathbf{M})$ is the normalized gap between the $r$ and $r+1$-st largest-magnitude eigenvector of $\mathbf{M}$), and 2) any algorithm $\mathsf{Alg}$ which takes fewer than $\mathrm{const} \times \frac{r \log d}{\sqrt{\mathtt{gap}}}$ queries fails (with overwhelming probability) to identity a matrix $\widehat{\mathsf{V}} \in \mathbb{R}^{d \times r}$ with orthonormal columns for which $\langle \widehat{\mathsf{V}}, \mathbf{M} \widehat{\mathsf{V}}\rangle \ge (1 - \mathrm{const} \times \mathtt{gap})\sum_{i=1}^r \lambda_i(\mathbf{M})$. Our bound requires only that $d$ is a small polynomial in $1/\mathtt{gap}$ and $r$, and matches the upper bounds of Musco and Musco '15. Moreover, it establishes a strict separation between convex optimization and \emph{randomized}, "strict-saddle" non-convex optimization of which PCA is a canonical example: in the former, first-order methods can have dimension-free iteration complexity, whereas in PCA, the iteration complexity of gradient-based methods must necessarily grow with the dimension.