data-centric ai
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark
Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper addresses this issue by exploring the potential of integrating data-centric AI techniques which profile the data to guide the synthetic data generation process. Moreover, we shed light on the often ignored consequences of neglecting these data profiles during synthetic data generation --- despite seemingly high statistical fidelity. Subsequently, we propose a novel framework to evaluate the integration of data profiles to guide the creation of more representative synthetic data. In an empirical study, we evaluate the performance of five state-of-the-art models for tabular data generation on eleven distinct tabular datasets.
Survey and Taxonomy: The Role of Data-Centric AI in Transformer-Based Time Series Forecasting
Xu, Jingjing, Wu, Caesar, Li, Yuan-Fang, Danoy, Gregoire, Bouvry, Pascal
Alongside the continuous process of improving AI performance through the development of more sophisticated models, researchers have also focused their attention to the emerging concept of data-centric AI, which emphasizes the important role of data in a systematic machine learning training process. Nonetheless, the development of models has also continued apace. One result of this progress is the development of the Transformer Architecture, which possesses a high level of capability in multiple domains such as Natural Language Processing (NLP), Computer Vision (CV) and Time Series Forecasting (TSF). Its performance is, however, heavily dependent on input data preprocessing and output data evaluation, justifying a data-centric approach to future research. We argue that data-centric AI is essential for training AI models, particularly for transformer-based TSF models efficiently. However, there is a gap regarding the integration of transformer-based TSF and data-centric AI. This survey aims to pin down this gap via the extensive literature review based on the proposed taxonomy. We review the previous research works from a data-centric AI perspective and we intend to lay the foundation work for the future development of transformer-based architecture and data-centric AI.
- North America > United States (0.28)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Oceania > Australia (0.04)
- (4 more...)
- Research Report (1.00)
- Overview (1.00)
- Health & Medicine (0.68)
- Energy (0.68)
Data-Centric AI in the Age of Large Language Models
Xu, Xinyi, Wu, Zhaoxuan, Qiao, Rui, Verma, Arun, Shu, Yao, Wang, Jingtan, Niu, Xinyuan, He, Zhenfeng, Chen, Jiangwei, Zhou, Zijian, Lau, Gregory Kang Ruey, Dao, Hieu, Agussurja, Lucas, Sim, Rachael Hwee Ling, Lin, Xiaoqiang, Hu, Wenyang, Dai, Zhongxiang, Koh, Pang Wei, Low, Bryan Kian Hsiang
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs, and yet it receives disproportionally low attention from the research community. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization. In each scenario, we underscore the importance of data, highlight promising research directions, and articulate the potential impacts on the research community and, where applicable, the society as a whole. For instance, we advocate for a suite of data-centric benchmarks tailored to the scale and complexity of data for LLMs. These benchmarks can be used to develop new data curation methods and document research efforts and results, which can help promote openness and transparency in AI and LLM research.
- Asia > Singapore (0.05)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (5 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Education (0.67)
Model-Based Data-Centric AI: Bridging the Divide Between Academic Ideals and Industrial Pragmatism
Park, Chanjun, Khang, Minsoo, Kim, Dahyun
This paper delves into the contrasting roles of data within academic and industrial spheres, highlighting the divergence between Data-Centric AI and Model-Agnostic AI approaches. We argue that while Data-Centric AI focuses on the primacy of high-quality data for model performance, Model-Agnostic AI prioritizes algorithmic flexibility, often at the expense of data quality considerations. This distinction reveals that academic standards for data quality frequently do not meet the rigorous demands of industrial applications, leading to potential pitfalls in deploying academic models in real-world settings. Through a comprehensive analysis, we address these disparities, presenting both the challenges they pose and strategies for bridging the gap. Furthermore, we propose a novel paradigm: Model-Based Data-Centric AI, which aims to reconcile these differences by integrating model considerations into data optimization processes. This approach underscores the necessity for evolving data requirements that are sensitive to the nuances of both academic research and industrial deployment. By exploring these discrepancies, we aim to foster a more nuanced understanding of data's role in AI development and encourage a convergence of academic and industrial standards to enhance AI's real-world applicability.
Data-Centric Artificial Intelligence
Jakubik, Johannes, Vössing, Michael, Kühl, Niklas, Walk, Jannis, Satzger, Gerhard
Data-centric artificial intelligence (data-centric AI) represents an emerging paradigm emphasizing that the systematic design and engineering of data is essential for building effective and efficient AI-based systems. The objective of this article is to introduce practitioners and researchers from the field of Information Systems (IS) to data-centric AI. We define relevant terms, provide key characteristics to contrast the data-centric paradigm to the model-centric one, and introduce a framework for data-centric AI. We distinguish data-centric AI from related concepts and discuss its longer-term implications for the IS community.
- North America > United States > Hawaii (0.04)
- Europe > Germany > Bavaria > Upper Franconia > Bayreuth (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
- (2 more...)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark
Hansen, Lasse, Seedat, Nabeel, van der Schaar, Mihaela, Petrovic, Andrija
Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper addresses this issue by exploring the potential of integrating data-centric AI techniques which profile the data to guide the synthetic data generation process. Moreover, we shed light on the often ignored consequences of neglecting these data profiles during synthetic data generation -- despite seemingly high statistical fidelity. Subsequently, we propose a novel framework to evaluate the integration of data profiles to guide the creation of more representative synthetic data. In an empirical study, we evaluate the performance of five state-of-the-art models for tabular data generation on eleven distinct tabular datasets. The findings offer critical insights into the successes and limitations of current synthetic data generation techniques. Finally, we provide practical recommendations for integrating data-centric insights into the synthetic data generation process, with a specific focus on classification performance, model selection, and feature selection. This study aims to reevaluate conventional approaches to synthetic data generation and promote the application of data-centric AI techniques in improving the quality and effectiveness of synthetic data.
The Principles of Data-Centric AI
The role of data and its quality in supporting AI systems is gaining prominence and giving rise to the concept of data-centric AI (DCAI), which breaks away from widespread model-centric approaches. The flurry of conversation around DCAI can be credited to a recent campaign by Andrew Ng, an AI pioneer, and his colleagues. However, DCAI is a culmination of concerns and efforts around improving data quality in AI projects. DCAI can be understood as an emerging term for a wealth of preceding practices and research work around data quality that complements structured frameworks such as human-centered data science.4,5 As such, the nature of'data work' itself is not necessarily new.35
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Data Science > Data Quality (0.67)
DataCI: A Platform for Data-Centric AI on Streaming Data
Zhang, Huaizheng, Huang, Yizheng, Li, Yuanming
We introduce DataCI, a comprehensive open-source platform designed specifically for data-centric AI in dynamic streaming data settings. DataCI provides 1) an infrastructure with rich APIs for seamless streaming dataset management, data-centric pipeline development and evaluation on streaming scenarios, 2) an carefully designed versioning control function to track the pipeline lineage, and 3) an intuitive graphical interface for a better interactive user experience. Preliminary studies and demonstrations attest to the easy-to-use and effectiveness of DataCI, highlighting its potential to revolutionize the practice of data-centric AI in streaming data contexts.
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Data Science (0.95)
- Information Technology > Communications > Networks (0.91)
- Information Technology > Human Computer Interaction (0.70)
Data-centric Artificial Intelligence: A Survey
Zha, Daochen, Bhat, Zaid Pervaiz, Lai, Kwei-Herng, Yang, Fan, Jiang, Zhimeng, Zhong, Shaochen, Hu, Xia
Artificial Intelligence (AI) is making a profound impact in almost every domain. A vital enabler of its great success is the availability of abundant and high-quality data for building machine learning models. Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI. The attention of researchers and practitioners has gradually shifted from advancing model design to enhancing the quality and quantity of the data. In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and the representative methods. We also organize the existing literature from automation and collaboration perspectives, discuss the challenges, and tabulate the benchmarks for various tasks. We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle. We hope it can help the readers efficiently grasp a broad picture of this field, and equip them with the techniques and further research ideas to systematically engineer data for building AI systems. A companion list of data-centric AI resources will be regularly updated on https://github.com/daochenzha/data-centric-AI
- North America > United States > Florida > Hillsborough County > University (0.05)
- North America > United States > Texas > Brazos County > College Station (0.04)
- Europe > United Kingdom > England > Leicestershire > Leicester (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report (1.00)
- Overview (1.00)
What Are the Data-Centric AI Concepts behind GPT Models?
Artificial Intelligence (AI) has made incredible strides in transforming the way we live, work, and interact with technology. Recently, that one area that has seen significant progress is the development of Large Language Models (LLMs), such as GPT-3, ChatGPT, and GPT-4. These models are capable of performing tasks such as language translation, text summarization, and question-answering with impressive accuracy. While it's difficult to ignore the increasing model size of LLMs, it's also important to recognize that their success is due largely to the large amount and high-quality data used to train them. In this article, we will present an overview of the recent advancements in LLMs from a data-centric AI perspective, drawing upon insights from our recent survey papers [1,2] with corresponding technical resources on GitHub.