AITopics | data-centric ai

Collaborating Authors

data-centric ai

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Neural Information Processing SystemsDec-25-2025, 21:22:53 GMT

Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper addresses this issue by exploring the potential of integrating data-centric AI techniques which profile the data to guide the synthetic data generation process. Moreover, we shed light on the often ignored consequences of neglecting these data profiles during synthetic data generation --- despite seemingly high statistical fidelity. Subsequently, we propose a novel framework to evaluate the integration of data profiles to guide the creation of more representative synthetic data. In an empirical study, we evaluate the performance of five state-of-the-art models for tabular data generation on eleven distinct tabular datasets. The findings offer critical insights into the successes and limitations of current synthetic data generation techniques. Finally, we provide practical recommendations for integrating data-centric insights into the synthetic data generation process, with a specific focus on classification performance, model selection, and feature selection. This study aims to reevaluate conventional approaches to synthetic data generation and promote the application of data-centric AI techniques in improving the quality and effectiveness of synthetic data.

data-centric ai, name change, reimagining synthetic tabular data generation, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Neural Information Processing SystemsJan-19-2025, 01:41:34 GMT

comprehensive benchmark, data-centric ai, reimagining synthetic tabular data generation, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Survey and Taxonomy: The Role of Data-Centric AI in Transformer-Based Time Series Forecasting

Xu, Jingjing, Wu, Caesar, Li, Yuan-Fang, Danoy, Gregoire, Bouvry, Pascal

arXiv.org Artificial IntelligenceJul-29-2024

Alongside the continuous process of improving AI performance through the development of more sophisticated models, researchers have also focused their attention to the emerging concept of data-centric AI, which emphasizes the important role of data in a systematic machine learning training process. Nonetheless, the development of models has also continued apace. One result of this progress is the development of the Transformer Architecture, which possesses a high level of capability in multiple domains such as Natural Language Processing (NLP), Computer Vision (CV) and Time Series Forecasting (TSF). Its performance is, however, heavily dependent on input data preprocessing and output data evaluation, justifying a data-centric approach to future research. We argue that data-centric AI is essential for training AI models, particularly for transformer-based TSF models efficiently. However, there is a gap regarding the integration of transformer-based TSF and data-centric AI. This survey aims to pin down this gap via the extensive literature review based on the proposed taxonomy. We review the previous research works from a data-centric AI perspective and we intend to lay the foundation work for the future development of transformer-based architecture and data-centric AI.

forecasting, time series forecasting, transformer, (6 more...)

arXiv.org Artificial Intelligence

2407.19784

Country:

North America > United States (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Oceania > Australia (0.04)
(4 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Health & Medicine (0.68)
Energy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Data-Centric AI in the Age of Large Language Models

Xu, Xinyi, Wu, Zhaoxuan, Qiao, Rui, Verma, Arun, Shu, Yao, Wang, Jingtan, Niu, Xinyuan, He, Zhenfeng, Chen, Jiangwei, Zhou, Zijian, Lau, Gregory Kang Ruey, Dao, Hieu, Agussurja, Lucas, Sim, Rachael Hwee Ling, Lin, Xiaoqiang, Hu, Wenyang, Dai, Zhongxiang, Koh, Pang Wei, Low, Bryan Kian Hsiang

arXiv.org Artificial IntelligenceJun-20-2024

This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs, and yet it receives disproportionally low attention from the research community. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization. In each scenario, we underscore the importance of data, highlight promising research directions, and articulate the potential impacts on the research community and, where applicable, the society as a whole. For instance, we advocate for a suite of data-centric benchmarks tailored to the scale and complexity of data for LLMs. These benchmarks can be used to develop new data curation methods and document research efforts and results, which can help promote openness and transparency in AI and LLM research.

language model, llm, proc, (14 more...)

arXiv.org Artificial Intelligence

2406.14473

Country:

Asia > Singapore (0.05)
Asia > Middle East > Jordan (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Model-Based Data-Centric AI: Bridging the Divide Between Academic Ideals and Industrial Pragmatism

Park, Chanjun, Khang, Minsoo, Kim, Dahyun

arXiv.org Artificial IntelligenceMar-4-2024

This paper delves into the contrasting roles of data within academic and industrial spheres, highlighting the divergence between Data-Centric AI and Model-Agnostic AI approaches. We argue that while Data-Centric AI focuses on the primacy of high-quality data for model performance, Model-Agnostic AI prioritizes algorithmic flexibility, often at the expense of data quality considerations. This distinction reveals that academic standards for data quality frequently do not meet the rigorous demands of industrial applications, leading to potential pitfalls in deploying academic models in real-world settings. Through a comprehensive analysis, we address these disparities, presenting both the challenges they pose and strategies for bridging the gap. Furthermore, we propose a novel paradigm: Model-Based Data-Centric AI, which aims to reconcile these differences by integrating model considerations into data optimization processes. This approach underscores the necessity for evolving data requirements that are sensitive to the nuances of both academic research and industrial deployment. By exploring these discrepancies, we aim to foster a more nuanced understanding of data's role in AI development and encourage a convergence of academic and industrial standards to enhance AI's real-world applicability.

application, arxiv preprint arxiv, dataset, (12 more...)

arXiv.org Artificial Intelligence

2403.01832

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.73)

Add feedback

Data-Centric Artificial Intelligence

Jakubik, Johannes, Vössing, Michael, Kühl, Niklas, Walk, Jannis, Satzger, Gerhard

arXiv.org Artificial IntelligenceJan-18-2024

Data-centric artificial intelligence (data-centric AI) represents an emerging paradigm emphasizing that the systematic design and engineering of data is essential for building effective and efficient AI-based systems. The objective of this article is to introduce practitioners and researchers from the field of Information Systems (IS) to data-centric AI. We define relevant terms, provide key characteristics to contrast the data-centric paradigm to the model-centric one, and introduce a framework for data-centric AI. We distinguish data-centric AI from related concepts and discuss its longer-term implications for the IS community.

ai-based system, data-centric ai, paradigm, (14 more...)

arXiv.org Artificial Intelligence

2212.11854

Country:

North America > United States > Hawaii (0.04)
Europe > Germany > Bavaria > Upper Franconia > Bayreuth (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Hansen, Lasse, Seedat, Nabeel, van der Schaar, Mihaela, Petrovic, Andrija

arXiv.org Artificial IntelligenceOct-25-2023

Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper addresses this issue by exploring the potential of integrating data-centric AI techniques which profile the data to guide the synthetic data generation process. Moreover, we shed light on the often ignored consequences of neglecting these data profiles during synthetic data generation -- despite seemingly high statistical fidelity. Subsequently, we propose a novel framework to evaluate the integration of data profiles to guide the creation of more representative synthetic data. In an empirical study, we evaluate the performance of five state-of-the-art models for tabular data generation on eleven distinct tabular datasets. The findings offer critical insights into the successes and limitations of current synthetic data generation techniques. Finally, we provide practical recommendations for integrating data-centric insights into the synthetic data generation process, with a specific focus on classification performance, model selection, and feature selection. This study aims to reevaluate conventional approaches to synthetic data generation and promote the application of data-centric AI techniques in improving the quality and effectiveness of synthetic data.

comprehensive benchmark, data-centric ai, reimagining synthetic tabular data generation

arXiv.org Artificial Intelligence

2310.16981

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

The Principles of Data-Centric AI

Communications of the ACMJul-31-2023, 22:21:34 GMT

The role of data and its quality in supporting AI systems is gaining prominence and giving rise to the concept of data-centric AI (DCAI), which breaks away from widespread model-centric approaches. The flurry of conversation around DCAI can be credited to a recent campaign by Andrew Ng, an AI pioneer, and his colleagues. However, DCAI is a culmination of concerns and efforts around improving data quality in AI projects. DCAI can be understood as an emerging term for a wealth of preceding practices and research work around data quality that complements structured frameworks such as human-centered data science.4,5 As such, the nature of'data work' itself is not necessarily new.35

ai system, data quality, data-centric ai, (6 more...)

Communications of the ACM

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Data Science > Data Quality (0.67)

Add feedback

DataCI: A Platform for Data-Centric AI on Streaming Data

Zhang, Huaizheng, Huang, Yizheng, Li, Yuanming

arXiv.org Artificial IntelligenceJul-3-2023

We introduce DataCI, a comprehensive open-source platform designed specifically for data-centric AI in dynamic streaming data settings. DataCI provides 1) an infrastructure with rich APIs for seamless streaming dataset management, data-centric pipeline development and evaluation on streaming scenarios, 2) an carefully designed versioning control function to track the pipeline lineage, and 3) an intuitive graphical interface for a better interactive user experience. Preliminary studies and demonstrations attest to the easy-to-use and effectiveness of DataCI, highlighting its potential to revolutionize the practice of data-centric AI in streaming data contexts.

artificial intelligence, human computer interaction, pipeline, (14 more...)

arXiv.org Artificial Intelligence

2306.15538

Country: North America > United States > Hawaii > Honolulu County > Honolulu (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Data Science (0.95)
Information Technology > Communications > Networks (0.91)
Information Technology > Human Computer Interaction (0.70)

Add feedback

Data-centric Artificial Intelligence: A Survey

Zha, Daochen, Bhat, Zaid Pervaiz, Lai, Kwei-Herng, Yang, Fan, Jiang, Zhimeng, Zhong, Shaochen, Hu, Xia

arXiv.org Artificial IntelligenceJun-11-2023

Artificial Intelligence (AI) is making a profound impact in almost every domain. A vital enabler of its great success is the availability of abundant and high-quality data for building machine learning models. Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI. The attention of researchers and practitioners has gradually shifted from advancing model design to enhancing the quality and quantity of the data. In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and the representative methods. We also organize the existing literature from automation and collaboration perspectives, discuss the challenges, and tabulate the benchmarks for various tasks. We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle. We hope it can help the readers efficiently grasp a broad picture of this field, and equip them with the techniques and further research ideas to systematically engineer data for building AI systems. A companion list of data-centric AI resources will be regularly updated on https://github.com/daochenzha/data-centric-AI

arxiv preprint arxiv, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2303.10158

Country:

North America > United States > Florida > Hillsborough County > University (0.05)
North America > United States > Texas > Brazos County > College Station (0.04)
Europe > United Kingdom > England > Leicestershire > Leicester (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (0.92)
Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(6 more...)

Add feedback