AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark
Li, Lan, Fang, Liri, Torvik, Vetle I.
–arXiv.org Artificial Intelligence
We investigate the reasoning capabilities of large language models (LLMs) for automatically generating data-cleaning workflows. To evaluate LLMs' ability to complete data-cleaning tasks, we implemented a pipeline for LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow), prompting LLMs on data cleaning operations to repair three types of data quality issues: duplicates, missing values, and inconsistent data formats. Given a dirty table and a purpose (expressed as a query), this pipeline generates a minimal, clean table sufficient to address the purpose and the data cleaning workflow used to produce the table. The planning process involves three main LLM-driven components: (1) Select Target Columns: Identifies a set of target columns related to the purpose. (2) Inspect Column Quality: Assesses the data quality for each target column and generates a Data Quality Report as operation objectives. (3) Generate Operation & Arguments: Predicts the next operation and arguments based on the data quality report results. Additionally, we propose a data cleaning benchmark to evaluate the capability of LLM agents to automatically generate workflows that address data cleaning purposes of varying difficulty levels. The benchmark comprises the annotated datasets as a collection of purpose, raw table, clean table, data cleaning workflow, and answer set. In our experiments, we evaluated three LLMs that auto-generate purpose-driven data cleaning workflows. The results indicate that LLMs perform well in planning and generating data-cleaning workflows without the need for fine-tuning.
arXiv.org Artificial Intelligence
Dec-12-2024
- Country:
- Africa > Ethiopia
- Addis Ababa > Addis Ababa (0.04)
- Asia > China (0.04)
- North America > United States
- Illinois
- Champaign County > Urbana (0.04)
- Cook County > Chicago (0.05)
- Michigan (0.04)
- New York (0.04)
- Illinois
- South America > Chile
- Africa > Ethiopia
- Genre:
- Research Report > New Finding (0.66)
- Workflow (1.00)
- Industry:
- Technology: