Goto

Collaborating Authors

 database system


Fast Factorized Learning: Powered by In-Memory Database Systems

Stöckl, Bernhard, Schüle, Maximilian E.

arXiv.org Artificial Intelligence

Learning models over factorized joins avoids redundant computations by identifying and pre-computing shared cofactors. Previous work has investigated the performance gain when computing cofactors on traditional disk-based database systems. Due to the absence of published code, the experiments could not be reproduced on in-memory database systems. This work describes the implementation when using cofactors for in-database factorized learning. We benchmark our open-source implementation for learning linear regression on factorized joins with PostgreSQL -- as a disk-based database system -- and HyPer -- as an in-memory engine. The evaluation shows a performance gain of factorized learning on in-memory database systems by 70\% to non-factorized learning and by a factor of 100 compared to disk-based database systems. Thus, modern database engines can contribute to the machine learning pipeline by pre-computing aggregates prior to data extraction to accelerate training.


Bootstrapping Learned Cost Models with Synthetic SQL Queries

Nidd, Michael, Miksovic, Christoph, Gschwind, Thomas, Fusco, Francesco, Giovannini, Andrea, Giurgiu, Ioana

arXiv.org Artificial Intelligence

Having access to realistic workloads for a given database instance is extremely important to enable stress and vulnerability testing, as well as to optimize for cost and performance. Recent advances in learned cost models have shown that when enough diverse SQL queries are available, one can effectively and efficiently predict the cost of running a given query against a specific database engine. In this paper, we describe our experience in exploiting modern synthetic data generation techniques, inspired by the generative AI and LLM community, to create high-quality datasets enabling the effective training of such learned cost models. Initial results show that we can improve a learned cost model's predictive accuracy by training it with 45% fewer queries than when using competitive generation approaches.


DBAIOps: A Reasoning LLM-Enhanced Database Operation and Maintenance System using Knowledge Graphs

Zhou, Wei, Sun, Peng, Zhou, Xuanhe, Zang, Qianglei, Xu, Ji, Zhang, Tieying, Li, Guoliang, Wu, Fan

arXiv.org Artificial Intelligence

The operation and maintenance (O&M) of database systems is critical to ensuring system availability and performance, typically requiring expert experience (e.g., identifying metric-to-anomaly relations) for effective diagnosis and recovery. However, existing automatic database O&M methods, including commercial products, cannot effectively utilize expert experience. On the one hand, rule-based methods only support basic O&M tasks (e.g., metric-based anomaly detection), which are mostly numerical equations and cannot effectively incorporate literal O&M experience (e.g., troubleshooting guidance in manuals). On the other hand, LLM-based methods, which retrieve fragmented information (e.g., standard documents + RAG), often generate inaccurate or generic results. To address these limitations, we present DBAIOps, a novel hybrid database O&M system that combines reasoning LLMs with knowledge graphs to achieve DBA-style diagnosis. First, DBAIOps introduces a heterogeneous graph model for representing the diagnosis experience, and proposes a semi-automatic graph construction algorithm to build that graph from thousands of documents. Second, DBAIOps develops a collection of (800+) reusable anomaly models that identify both directly alerted metrics and implicitly correlated experience and metrics. Third, for each anomaly, DBAIOps proposes a two-stage graph evolution mechanism to explore relevant diagnosis paths and identify missing relations automatically. It then leverages a reasoning LLM (e.g., DeepSeek-R1) to infer root causes and generate clear diagnosis reports for both DBAs and common users. Our evaluation over four mainstream database systems (Oracle, MySQL, PostgreSQL, and DM8) demonstrates that DBAIOps outperforms state-of-the-art baselines, 34.85% and 47.22% higher in root cause and human evaluation accuracy, respectively.


An advanced AI driven database system

Tedeschi, M., Rizwan, S., Shringi, C., Chandgir, V. Devram, Belich, S.

arXiv.org Artificial Intelligence

Contemporary database systems, while effective, suffer severe issues related to complexity and usability, especially among individuals who lack technical expertise but are unfamiliar with query languages like Structured Query Language (SQL). This paper presents a new database system supported by Artificial Intelligence (AI), which is intended to improve the management of data using natural language processing (NLP) - based intuitive interfaces, and automatic creation of structured queries and semi-structured data formats like yet another markup language (YAML), java script object notation (JSON), and application program interface (API) documentation. The system is intended to strengthen the potential of databases through the integration of Large Language Models (LLMs) and advanced machine learning algorithms. The integration is purposed to allow the automation of fundamental tasks such as data modeling, schema creation, query comprehension, and performance optimization. We present in this paper a system that aims to alleviate the main problems with current database technologies. It is meant to reduce the need for technical skills, manual tuning for better performance, and the potential for human error. The AI database employs generative schema inference and format selection to build its schema models and execution formats.


In-Context Adaptation to Concept Drift for Learned Database Operations

Zhu, Jiaqi, Cai, Shaofeng, Shen, Yanyan, Chen, Gang, Deng, Fang, Ooi, Beng Chin

arXiv.org Artificial Intelligence

Machine learning has demonstrated transformative potential for database operations, such as query optimization and in-database data analytics. However, dynamic database environments, characterized by frequent updates and evolving data distributions, introduce concept drift, which leads to performance degradation for learned models and limits their practical applicability. Addressing this challenge requires efficient frameworks capable of adapting to shifting concepts while minimizing the overhead of retraining or fine-tuning. In this paper, we propose FLAIR, an online adaptation framework that introduces a new paradigm called \textit{in-context adaptation} for learned database operations. FLAIR leverages the inherent property of data systems, i.e., immediate availability of execution results for predictions, to enable dynamic context construction. By formalizing adaptation as $f:(\mathbf{x} \,| \,C_t) \to \mathbf{y}$, with $C_t$ representing a dynamic context memory, FLAIR delivers predictions aligned with the current concept, eliminating the need for runtime parameter optimization. To achieve this, FLAIR integrates two key modules: a Task Featurization Module for encoding task-specific features into standardized representations, and a Dynamic Decision Engine, pre-trained via Bayesian meta-training, to adapt seamlessly using contextual information at runtime. Extensive experiments across key database tasks demonstrate that FLAIR outperforms state-of-the-art baselines, achieving up to 5.2x faster adaptation and reducing error by 22.5% for cardinality estimation.


Technical Perspective: Ad Hoc Transactions: What They Are and Why We Should Care

Communications of the ACM

Most database research papers are prescriptive. They identify a technical problem and show us how to solve it. Other papers follow a different path; they are descriptive rather than prescriptive. They tell us how data systems behave in practice and how they are actually used. They employ a different set of tools, such as surveys, software analyses, or user studies.


SQL4NN: Validation and expressive querying of models as data

Gerarts, Mark, Steegmans, Juno, Bussche, Jan Van den

arXiv.org Artificial Intelligence

Any serious machine learning project will quickly produce a multitude of models learned from data. These models are then validated, tested in different ways, modified, retrained, and archived or deployed. This multitude of models also constitutes valuable data in itself. We may refer to such data as intensional, in contrast to what we already know as extensional data: training data, background data, data for validation, etc. Our point of view in this paper is that models are data too and should be managed using database technology, just like the extensional data. In particular, we should be able to query models. Indeed, many tasks that we usually consider as validation, analysis, explanation, verification, pruning, etc., of models [Rud19, Alb21, LAL


Security Threats in Agentic AI System

Khan, Raihan, Sarkar, Sayak, Mahata, Sainik Kumar, Jose, Edwin

arXiv.org Artificial Intelligence

Artificial Intelligence (AI) agents have become increasingly prevalent in various applications, from virtual assistants to complex data analysis systems. However, their direct access to databases raises significant concerns regarding privacy and security. This paper examines these critical issues, focusing on the potential risks posed by unrestricted AI access to sensitive data. The rapid advancement of AI technologies has resulted in systems capable of processing vast amounts of data and generating human-like responses. While this progress has provided numerous benefits, it has also introduced new challenges in ensuring data privacy and security. AI agents with direct access to databases may inadvertently expose confidential information, or they may be exploited by malicious actors to access or manipulate sensitive data. Additionally, AI systems' ability to analyze large datasets increases the risk of unintended privacy violations, making them prime targets for attacks aimed at extracting or misusing data. This paper explores the current landscape of AI agent interactions with databases and analyzes the associated risks. It discusses the potential threats to privacy protection and data security as AI agents become more integrated into various applications.


QirK: Question Answering via Intermediate Representation on Knowledge Graphs

Scheerer, Jan Luca, Lykov, Anton, Kayali, Moe, Fountalis, Ilias, Olteanu, Dan, Vasiloglou, Nikolaos, Suciu, Dan

arXiv.org Artificial Intelligence

We demonstrate QirK, a system for answering natural language questions on Knowledge Graphs (KG). QirK can answer structurally complex questions that are still beyond the reach of emerging Large Language Models (LLMs). It does so using a unique combination of database technology, LLMs, and semantic search over vector embeddings. The glue for these components is an intermediate representation (IR). The input question is mapped to IR using LLMs, which is then repaired into a valid relational database query with the aid of a semantic search on vector embeddings. This allows a practical synthesis of LLM capabilities and KG reliability. A short video demonstrating QirK is available at https://youtu.be/6c81BLmOZ0U.


In-Database Data Imputation

Perini, Massimo, Nikolic, Milos

arXiv.org Artificial Intelligence

Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making. Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates (e.g., mean), are computationally efficient but may introduce bias and disrupt variable relationships, leading to inaccurate analyses. Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time, limiting their applicability to small datasets. This work enables efficient, high-quality, and scalable data imputation within a database system using the widely used MICE method. We adapt this method to exploit computation sharing and a ring abstraction for faster model training. To impute both continuous and categorical values, we develop techniques for in-database learning of stochastic linear regression and Gaussian discriminant analysis models. Our MICE implementations in PostgreSQL and DuckDB outperform alternative MICE implementations and model-based imputation techniques by up to two orders of magnitude in terms of computation time, while maintaining high imputation quality.