Making Sense of Data in the Wild: Data Analysis Automation at Scale

Graziani, Mara, Molnar, Malina, Morales, Irina Espejo, Cadow-Gossweiler, Joris, Laino, Teodoro

Jan-27-2025–arXiv.org Artificial Intelligence

As the volume of publicly available data continues to grow, researchers face the challenge of limited diversity in benchmarking machine learning tasks. Although thousands of datasets are available in public repositories, the sheer abundance often complicates the search for suitable data, leaving many valuable datasets underexplored. This situation is further amplified by the fact that, despite longstanding advocacy for improving data curation quality, current solutions remain prohibitively time-consuming and resource-intensive. In this paper, we propose a novel approach that combines intelligent agents with retrieval augmented generation to automate data analysis, dataset curation and indexing at scale. Our system leverages multiple agents to analyze raw, unstructured data across public repositories, generating dataset reports and interactive visual indexes that can be easily explored. We demonstrate that our approach results in more detailed dataset descriptions, higher hit rates and greater diversity in dataset retrieval tasks. Additionally, we show that the dataset reports generated by our method can be leveraged by other machine learning models to improve the performance on specific tasks, such as improving the accuracy and realism of synthetic data generation. By streamlining the process of transforming raw data into machine-learning-ready datasets, our approach enables researchers to better utilize existing data resources.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jan-27-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - New York (0.14)
    - Massachusetts > Middlesex County (0.14)
    - Texas > Travis County
      - Austin (0.14)
    - California
      - Santa Clara County (0.14)
      - Alameda County > Berkeley (0.14)
  - Canada > Ontario
    - Toronto (0.14)
- Europe
  - Germany (0.28)
  - Sweden (0.14)
  - Italy (0.14)

Genre:
- Research Report
  - New Finding (0.67)
  - Experimental Study (0.46)

Industry:
- Health & Medicine (1.00)
- Transportation (0.93)
- Law > Intellectual Property & Technology Law (0.93)
- Information Technology (0.68)
- Automobiles & Trucks (0.67)
- Materials > Chemicals
  - Commodity Chemicals > Petrochemicals (1.00)
  - Industrial Gases (0.68)
- Energy
  - Renewable (0.93)
  - Oil & Gas > Upstream (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Agents (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found