Goto

Collaborating Authors

 database engine


Reinforcing Code Generation: Improving Text-to-SQL with Execution-Based Learning

arXiv.org Artificial Intelligence

In this work, we study the problem of code generation with a large language model (LLM), with a focus on generating SQL queries from natural language questions. We ask: Instead of using supervised fine tuning with text-code pairs, can we tune a model by having it interact with a database engine? We frame this problem as a reinforcement learning problem where the model receives execution-based feedback from the environment in the form of scalar rewards. These rewards penalize execution failures and assign positive values when a query returns a correct answer. We use the rewards within the Group Relative Policy Optimization (GRPO) framework. We use a tabular reasoning benchmark to test and evaluate our findings. We find that with only weak supervision in the form of question-answer pairs, RL-tuning improves the accuracy of model generated SQL code from 31.49 to 49.83 while reducing error percentage from 25.43% to 14.71%. This improvement allowed the model nearly match the performance performance to the larger SQLCoder-70B model. Our work demonstrates the potential of using execution-based feedback to improve symbolic reasoning capabilities of LLMs.


Is Large Language Model Good at Database Knob Tuning? A Comprehensive Experimental Evaluation

arXiv.org Artificial Intelligence

Knob tuning plays a crucial role in optimizing databases by adjusting knobs to enhance database performance. However, traditional tuning methods often follow a Try-Collect-Adjust approach, proving inefficient and database-specific. Moreover, these methods are often opaque, making it challenging for DBAs to grasp the underlying decision-making process. The emergence of large language models (LLMs) like GPT-4 and Claude-3 has excelled in complex natural language tasks, yet their potential in database knob tuning remains largely unexplored. This study harnesses LLMs as experienced DBAs for knob-tuning tasks with carefully designed prompts. We identify three key subtasks in the tuning system: knob pruning, model initialization, and knob recommendation, proposing LLM-driven solutions to replace conventional methods for each subtask. We conduct extensive experiments to compare LLM-driven approaches against traditional methods across the subtasks to evaluate LLMs' efficacy in the knob tuning domain. Furthermore, we explore the adaptability of LLM-based solutions in diverse evaluation settings, encompassing new benchmarks, database engines, and hardware environments. Our findings reveal that LLMs not only match or surpass traditional methods but also exhibit notable interpretability by generating responses in a coherent ``chain-of-thought'' manner. We further observe that LLMs exhibit remarkable generalizability through simple adjustments in prompts, eliminating the necessity for additional training or extensive code modifications. Drawing insights from our experimental findings, we identify several opportunities for future research aimed at advancing the utilization of LLMs in the realm of database management.


LLMTune: Accelerate Database Knob Tuning with Large Language Models

arXiv.org Artificial Intelligence

Database knob tuning is a critical challenge in the database community, aiming to optimize knob values to enhance database performance for specific workloads. DBMS often feature hundreds of tunable knobs, posing a significant challenge for DBAs to recommend optimal configurations. Consequently, many machine learning-based tuning methods have been developed to automate this process. Despite the introduction of various optimizers, practical applications have unveiled a new problem: they typically require numerous workload runs to achieve satisfactory performance, a process that is both time-consuming and resource-intensive. This inefficiency largely stems from the optimal configuration often being substantially different from the default setting, necessitating multiple iterations during tuning. Recognizing this, we argue that an effective starting point could significantly reduce redundant exploration in less efficient areas, thereby potentially speeding up the tuning process for the optimizers. Based on this assumption, we introduce LLMTune, a large language model-based configuration generator designed to produce an initial, high-quality configuration for new workloads. These generated configurations can then serve as starting points for various base optimizers, accelerating their tuning processes. To obtain training data for LLMTune's supervised fine-tuning, we have devised a new automatic data generation framework capable of efficiently creating a large number of pairs. We have conducted thorough experiments to evaluate LLMTune's effectiveness with different workloads, such as TPC-H and JOB. In comparison to leading methods, LLMTune demonstrates a quicker ability to identify superior configurations. For instance, with the challenging TPC-H workload, our LLMTune achieves a significant 15.6x speed-up ratio in finding the best-performing configurations.


The Seattle Report on Database Research

Communications of the ACM

From the inception of the field, academic database research has strongly influenced the database industry and vice versa. The database community, both research and industry, has grown substantially over the years. The relational database market alone has revenue upwards of $50B. On the academic front, database researchers continue to be recognized with significant awards. Over the last decade, our research community pioneered the use of columnar storage, which is used in all commercial data analytic platforms. Database systems offered as cloud services have witnessed explosive growth. Hybrid transactional/analytical processing (HTAP) systems are now an important segment of the industry. Furthermore, memory-optimized data structures, modern compilation, and code-generation have significantly enhanced performance of traditional database engines. All data platforms have embraced SQL-style APIs as the predominant way to query and retrieve data. Database researchers have played an important part in influencing the evolution of streaming data platforms as well as distributed key-value stores. A new generation of data cleaning and data wrangling technology is being actively explored.


Rob Mellor, WhereScape: On data warehouse automation

#artificialintelligence

Leading analysts and organisations have begun recognising data warehouse automation as being key to running a truly data-driven business. AI News caught up with Rob Mellor, GM & VP, EMEA at WhereScape, to discuss this industry shift. AI News: Only earlier this year did Gartner really begin recognising data warehouse automation after publishing a paper on the subject. Is this indicative of a shift in how companies view automation? Rob Mellor: At WhereScape, we feel the increased recent activity from Gartner around data warehouse automation is reflective of an industry shift.


Feature Stores need an HTAP Database

#artificialintelligence

A Feature Store is a collection of organized and curated features used for training and serving Machine Learning models. Keeping them up to date, serving feature vectors, and creating training data sets requires a combination of transactional (OLTP) and analytical (OLAP) database processing. This kind of mixed workload database is called HTAP for hybrid transactional analytical processing. The most useful Feature Stores incorporate data pipelines that continuously keep their features up to date through either batch or real-time processing that matches the cadence of the source data. Since these features are always up to date, they provide an ideal source of feature vectors used for inferencing.


Do the numbers, Einstein: AI is more than maths as some know it

@machinelearnbot

Microsoft arrived on the graph-database scene last month. Already on that scene are Neo4J, MarkLogic, Oracle, SAP and Teradata - among others. Driving Microsoft, like those before, is the desire to connect - to establish connections between things and derive some kind of gain. Those "things" could be people, "likes", online sales – tech firms are almost literally trying connecting the dots or as they like them to be called "nodes." The new thing is Artificial Intelligence and the Machine Learning that gets us there.


Top NoSQL Database Engines

@machinelearnbot

I am not a fan of the term NoSQL. Many others are, however, and it has become a permanent part of the collective data storage nomenclature, meant to describe schema-less, non-relational data storage schemes. NoSQL is an umbrella term, one which encompasses a number of different technologies. These different technologies aren't even necessarily related in any way beyond the single defining characteristic of NoSQL: they are not relational in nature; for right or wrong, Structured Query Language (SQL) has become conflated with relational database management systems over the years. So, while I am not personally a fan of the term NoSQL, I can appreciate why others are, given that it quickly implies what it is we are talking about by explicitly stating what we are not talking about.


Real-time machine learning on globally-distributed data with Apache Spark and DocumentDB

#artificialintelligence

At the Strata Hadoop World 2017 Conference in San Jose, we have announced the Spark to DocumentDB Connector. It enables real-time data science, machine learning, and exploration over globally distributed data in Azure DocumentDB. Connecting Apache Spark to Azure DocumentDB accelerates our customer's ability to solve fast-moving data science problems, where data can be quickly persisted and queried using DocumentDB. The Spark to DocumentDB connector efficiently exploits the native DocumentDB managed indexes and enables updateable columns when performing analytics, push-down predicate filtering against fast-changing globally-distributed data, ranging from IoT, data science, and analytics scenarios. The Spark to DocumentDB connector uses the Azure DocumentDB Java SDK.