Goto

Collaborating Authors

 database structure


DB-Explore: Automated Database Exploration and Instruction Synthesis for Text-to-SQL

arXiv.org Artificial Intelligence

Recent text-to-SQL systems powered by large language models (LLMs) have demonstrated remarkable performance in translating natural language queries into SQL. However, these systems often struggle with complex database structures and domain-specific queries, as they primarily focus on enhancing logical reasoning and SQL syntax while overlooking the critical need for comprehensive database understanding. To address this limitation, we propose DB-Explore, a novel framework that systematically aligns LLMs with database knowledge through automated exploration and instruction synthesis. DB-Explore constructs database graphs to capture complex relational schemas, leverages GPT-4 to systematically mine structural patterns and semantic knowledge, and synthesizes instructions to distill this knowledge for efficient fine-tuning of LLMs. Our framework enables comprehensive database understanding through diverse sampling strategies and automated instruction generation, bridging the gap between database structures and language models. Experiments conducted on the SPIDER and BIRD benchmarks validate the effectiveness of DB-Explore, achieving an execution accuracy of 52.1% on BIRD and 84.0% on SPIDER. Notably, our open-source implementation, based on the Qwen2.5-coder-7B model, outperforms multiple GPT-4-driven text-to-SQL systems in comparative evaluations, and achieves near state-of-the-art performance with minimal computational cost.


Automatic database description generation for Text-to-SQL

arXiv.org Artificial Intelligence

In the context of the Text-to-SQL task, table and column descriptions are crucial for bridging the gap between natural language and database schema. This report proposes a method for automatically generating effective database descriptions when explicit descriptions are unavailable. The proposed method employs a dual-process approach: a coarse-to-fine process, followed by a fine-to-coarse process. The coarse-to-fine approach leverages the inherent knowledge of LLM to guide the understanding process from databases to tables and finally to columns. This approach provides a holistic understanding of the database structure and ensures contextual alignment. Conversely, the fine-to-coarse approach starts at the column level, offering a more accurate and nuanced understanding when stepping back to the table level. Experimental results on the Bird benchmark indicate that using descriptions generated by the proposed improves SQL generation accuracy by 0.93\% compared to not using descriptions, and achieves 37\% of human-level performance. The source code is publicly available at https://github.com/XGenerationLab/XiYan-DBDescGen.


Natural Language Query Engine for Relational Databases using Generative AI

arXiv.org Artificial Intelligence

The growing reliance on data-driven decision-making highlights the need for more intuitive ways to access and analyze information stored in relational databases. However, the requirement of SQL knowledge has long been a significant barrier for non-technical users. This article introduces an innovative solution that leverages Generative AI to bridge this gap, enabling users to query databases using natural language. Our approach automatically translates natural language queries into SQL, ensuring both syntactic and semantic correctness, while also generating clear, natural language responses from the retrieved data. By streamlining the interaction between users and databases, this method empowers individuals without technical expertise to engage with data directly and efficiently, democratizing access to valuable insights and enhancing productivity.


Toward a Flexible Metadata Pipeline for Fish Specimen Images

arXiv.org Artificial Intelligence

Flexible metadata pipelines are crucial for supporting the FAIR data principles. Despite this need, researchers seldom report their approaches for identifying metadata standards and protocols that support optimal flexibility. This paper reports on an initiative targeting the development of a flexible metadata pipeline for a collection containing over 300,000 digital fish specimen images, harvested from multiple data repositories and fish collections. The images and their associated metadata are being used for AI-related scientific research involving automated species identification, segmentation and trait extraction. The paper provides contextual background, followed by the presentation of a four-phased approach involving: 1. Assessment of the Problem, 2. Investigation of Solutions, 3. Implementation, and 4. Refinement. The work is part of the NSF Harnessing the Data Revolution, Biology Guided Neural Networks (NSF/HDR-BGNN) project and the HDR Imageomics Institute. An RDF graph prototype pipeline is presented, followed by a discussion of research implications and conclusion summarizing the results.


Artificial Intelligence Basics for Senior Executives - Which-50

#artificialintelligence

Artificial intelligence has evolved to become one of the most overused and misunderstood terms in business while also offering the potential to be the driving force in business decision-making, automation, and scalability. Now is the time to develop strategies that set your business up for success for years to come. This article attempts to shed light on the origins, definition, types, business applications, and how senior executives can approach introducing AI to their business. Artificial intelligence (AI) is a broad term and describes technology's ability to perform intellectual tasks typically only performed by humans. Technically speaking, a spreadsheet that helps calculate insurance rates based on a range of inputs can be classified as AI.


Generating relevant explanation: Natural language responses to questions about database structure

Classics

If a generation system is to produce text in response to a given communicative goal, it must be able to determine what to include in its text and how to organize this information so that it can be easily understood. In this paper, a computational model of discourse strategies is presented that can be used to guide the generation process in its decisions about what to say next. The model is based on an analysis of naturally occurring texts and represents strategies that can be used for three communicative goals: define, compare, and describe. We show how this model has been implemented in text, a system which generates paragraph-length responses to questions about database structure.