Data Acquisition: A New Frontier in Data-centric AI

Chen, Lingjiao, Acun, Bilge, Ardalani, Newsha, Sun, Yifan, Kang, Feiyang, Lyu, Hanrui, Kwon, Yongchan, Jia, Ruoxi, Wu, Carole-Jean, Zaharia, Matei, Zou, James

Nov-22-2023–arXiv.org Artificial Intelligence

Datasets, the cornerstone of modern machine learning (ML) systems, have been increasingly sold and purchased for different ML pipelines [2]. Several data marketplaces have emerged to serve different stages of building ML-enhanced data applications. For example, NASDAQ Data Link [3] offers financial datasets cleaned and structured for model training, Amazon AWS data exchange [4] focuses on generic tabular datasets, and Databricks Marketplace [5] integrates raw datasets and ML pipelines to deliver insights. The data-as-a-service market size was more than 30 billions and is expected to double in the next five years [6]. While the data marketplaces are increasingly expanding, unfortunately, data acquisition for ML remains challenging, partially due to its ad-hoc nature: Based on discussions with real-world users, data acquirers often need to negotiate varying contracts with different data providers first, then purchase multiple datasets with different formats, and finally filtering out unnecessary data from the purchased datasets.

dataset, marketplace, provider, (13 more...)

arXiv.org Artificial Intelligence

Nov-22-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Virginia > Montgomery County
    - Blacksburg (0.04)
  - New York > New York County
    - New York City (0.04)
  - California
    - Alameda County > Berkeley (0.14)
    - San Mateo County > Menlo Park (0.04)
    - Santa Clara County
      - Palo Alto (0.04)
      - Stanford (0.04)

Genre:
- Research Report (0.82)

Industry:
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.88)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)