Goto

Collaborating Authors

 Query Processing


Goal-Aware Neural SAT Solver

arXiv.org Artificial Intelligence

Modern neural networks obtain information about the problem and calculate the output solely from the input values. We argue that it is not always optimal, and the network's performance can be significantly improved by augmenting it with a query mechanism that allows the network to make several solution trials at run time and get feedback on the loss value on each trial. To demonstrate the capabilities of the query mechanism, we formulate an unsupervised (not dependant on labels) loss function for Boolean Satisfiability Problem (SAT) and theoretically show that it allows the network to extract rich information about the problem. We then propose a neural SAT solver with a query mechanism called QuerySAT and show that it outperforms the neural baseline on a wide range of SAT tasks and the classical baselines on SHA-1 preimage attack and 3-SAT task.


On Margin-Based Cluster Recovery with Oracle Queries

arXiv.org Machine Learning

We study an active cluster recovery problem where, given a set of $n$ points and an oracle answering queries like "are these two points in the same cluster?", the task is to recover exactly all clusters using as few queries as possible. We begin by introducing a simple but general notion of margin between clusters that captures, as special cases, the margins used in previous work, the classic SVM margin, and standard notions of stability for center-based clusterings. Then, under our margin assumptions we design algorithms that, in a variety of settings, recover all clusters exactly using only $O(\log n)$ queries. For the Euclidean case, $\mathbb{R}^m$, we give an algorithm that recovers arbitrary convex clusters, in polynomial time, and with a number of queries that is lower than the best existing algorithm by $\Theta(m^m)$ factors. For general pseudometric spaces, where clusters might not be convex or might not have any notion of shape, we give an algorithm that achieves the $O(\log n)$ query bound, and is provably near-optimal as a function of the packing number of the space. Finally, for clusterings realized by binary concept classes, we give a combinatorial characterization of recoverability with $O(\log n)$ queries, and we show that, for many concept classes in Euclidean spaces, this characterization is equivalent to our margin condition. Our results show a deep connection between cluster margins and active cluster recoverability.


Database Reasoning Over Text

arXiv.org Artificial Intelligence

Neural models have shown impressive performance gains in answering queries from natural language text. However, existing works are unable to support database queries, such as "List/Count all female athletes who were born in 20th century", which require reasoning over sets of relevant facts with operations such as join, filtering and aggregation. We show that while state-of-the-art transformer models perform very well for small databases, they exhibit limitations in processing noisy data, numerical operations, and queries that aggregate facts. We propose a modular architecture to answer these database-style queries over multiple spans from text and aggregating these at scale. We evaluate the architecture using WikiNLDB, a novel dataset for exploring such queries. Our architecture scales to databases containing thousands of facts whereas contemporary models are limited by how many facts can be encoded. In direct comparison on small databases, our approach increases overall answer accuracy from 85% to 90%. On larger databases, our approach retains its accuracy whereas transformer baselines could not encode the context.


Database Workload Characterization with Query Plan Encoders

#artificialintelligence

Smart databases are adopting artificial intelligence (AI) technologies to achieve instance optimality, and in the future, databases will come with prepackaged AI models within their core components. The reason is that every database runs on different workloads, demands specific resources, and settings to achieve optimal performance. It prompts the necessity to understand workloads running in the system along with their features comprehensively, which we dub as workload characterization. To address this workload characterization problem, we propose our query plan encoders that learn essential features and their correlations from query plans. Our pretrained encoders capture the structural and the computational performance of queries independently.


Database Workload Characterization with Query Plan Encoders

arXiv.org Artificial Intelligence

Smart databases are adopting artificial intelligence (AI) technologies to achieve {\em instance optimality}, and in the future, databases will come with prepackaged AI models within their core components. The reason is that every database runs on different workloads, demands specific resources, and settings to achieve optimal performance. It prompts the necessity to understand workloads running in the system along with their features comprehensively, which we dub as workload characterization. To address this workload characterization problem, we propose our query plan encoders that learn essential features and their correlations from query plans. Our pretrained encoders capture the {\em structural} and the {\em computational performance} of queries independently. We show that our pretrained encoders are adaptable to workloads that expedite the transfer learning process. We performed independent assessments of structural encoder and performance encoders with multiple downstream tasks. For the overall evaluation of our query plan encoders, we architect two downstream tasks (i) query latency prediction and (ii) query classification. These tasks show the importance of feature-based workload characterization. We also performed extensive experiments on individual encoders to verify the effectiveness of representation learning and domain adaptability.


A Unified Transferable Model for ML-Enhanced DBMS

arXiv.org Artificial Intelligence

Recently, the database management system (DBMS) community has witnessed the power of machine learning (ML) solutions for DBMS tasks. Despite their promising performance, these existing solutions can hardly be considered satisfactory. First, these ML-based methods in DBMS are not effective enough because they are optimized on each specific task, and cannot explore or understand the intrinsic connections between tasks. Second, the training process has serious limitations that hinder their practicality, because they need to retrain the entire model from scratch for a new DB. Moreover, for each retraining, they require an excessive amount of training data, which is very expensive to acquire and unavailable for a new DB. We propose to explore the transferabilities of the ML methods both across tasks and across DBs to tackle these fundamental drawbacks. In this paper, we propose a unified model MTMLF that uses a multi-task training procedure to capture the transferable knowledge across tasks and a pretrain finetune procedure to distill the transferable meta knowledge across DBs. We believe this paradigm is more suitable for cloud DB service, and has the potential to revolutionize the way how ML is used in DBMS. Furthermore, to demonstrate the predicting power and viability of MTMLF, we provide a concrete and very promising case study on query optimization tasks. Last but not least, we discuss several concrete research opportunities along this line of work.


One Model to Rule them All: Towards Zero-Shot Learning for Databases

arXiv.org Artificial Intelligence

And unfortunately, the training data collection needs to be repeated for every new database that needs to be supported. In this paper, we present our vision of so called zero-shot learning To reduce the high cost of training data collection, reinforcement for databases which is a new learning approach for database learning (RL) has been used to execute training queries [10, 17, 18, components. Zero-shot learning for databases is inspired by recent 34] in a more targeted manner (i.e., letting the RL agent decide advances in transfer learning of models such as GPT-3 and can which queries to execute next). However, even with reinforcement support a new database out-of-the box without the need to train a learning still a large amount of training queries needs to be executed new model. As a first concrete contribution in this paper, we show for learning a model. Moreover, training the model is not a onetime the feasibility of zero-shot learning for the task of physical cost effort since similar to workload-driven approaches the learning estimation and present very promising initial results. Moreover, procedure needs to be repeated for every new database at hand. as a second contribution we discuss the core challenges related to A different direction that has thus been proposed to avoid the zero-shot learning for databases and present a roadmap to extend expensive training data collection by running queries on a new zero-shot learning towards many other tasks beyond cost estimation database are so called data-driven approaches [11, 31, 32] that learn or even beyond classical database systems and workloads.


Ahana Cloud for Presto review: Fast SQL queries against data lakes

#artificialintelligence

Hope springs eternal in the database business. While we're still hearing about data warehouses (fast analysis databases, typically featuring in-memory columnar storage) and tools that improve the ETL step (extract, transform, and load), we're also hearing about improvements in data lakes (which store data in its native format) and data federation (on-demand data integration of heterogeneous data stores). Presto keeps coming up as a fast way to perform SQL queries on big data that resides in data lake files. Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes. Presto allows querying data where it lives, including Hive, Cassandra, relational databases, and proprietary data stores.


INODE: Building an End-to-End Data Exploration System in Practice [Extended Vision]

arXiv.org Artificial Intelligence

A full-fledged data exploration system must combine different access modalities with a powerful concept of guiding the user in the exploration process, by being reactive and anticipative both for data discovery and for data linking. Such systems are a real opportunity for our community to cater to users with different domain and data science expertise. We introduce INODE -- an end-to-end data exploration system -- that leverages, on the one hand, Machine Learning and, on the other hand, semantics for the purpose of Data Management (DM). Our vision is to develop a classic unified, comprehensive platform that provides extensive access to open datasets, and we demonstrate it in three significant use cases in the fields of Cancer Biomarker Reearch, Research and Innovation Policy Making, and Astrophysics. INODE offers sustainable services in (a) data modeling and linking, (b) integrated query processing using natural language, (c) guidance, and (d) data exploration through visualization, thus facilitating the user in discovering new insights. We demonstrate that our system is uniquely accessible to a wide range of users from larger scientific communities to the public. Finally, we briefly illustrate how this work paves the way for new research opportunities in DM.


The RLR-Tree: A Reinforcement Learning Based R-Tree for Spatial Data

arXiv.org Artificial Intelligence

Despite the success of these learned indices in improving the performance Learned indices have been proposed to replace classic index structures of some types of queries, they still have various limitations, like B-Tree with machine learning (ML) models. They require e.g., they can only handle spatial point objects and limited types to replace both the indices and query processing algorithms currently of spatial queries, some only return approximate query results, deployed by the databases, and such a radical departure is and they either cannot handle updates or need a periodic rebuild likely to encounter challenges and obstacles. In contrast, we propose to retain high query efficiency (Detailed discussions are in Section a fundamentally different way of using ML techniques to 2). These limitations, together with the requirement that the improve on the query performance of the classic R-Tree without learned indices need a replacement of the index structures and the need of changing its structure or query processing algorithms.