Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development

Chen, Daoyuan, Wang, Haibin, Huang, Yilun, Ge, Ce, Li, Yaliang, Ding, Bolin, Zhou, Jingren

Jul-16-2024–arXiv.org Artificial Intelligence

The emergence of large-scale multi-modal generative models has drastically advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a novel sandbox suite tailored for integrated data-model co-development. This sandbox provides a comprehensive experimental platform, enabling rapid iteration and insight-driven refinement of both data and models. Our proposed "Probe-Analyze-Refine" workflow, validated through applications on state-of-theart LLaVA-like and DiT-based models, yields significant performance boosts, such as topping the VBench leaderboard. We also uncover fruitful insights gleaned from exhaustive benchmarks, shedding light on the critical interplay between data quality, diversity, and model behavior. With the hope of fostering deeper understanding and future progress in multi-modal data and generative modeling, our codes, datasets, and models are maintained and accessible at https://github. The advent of multi-modal generative models has revolutionized artificial intelligence, pushing the boundaries of functionality and creativity across various domains (OpenAI, 2024a;b; Wang et al., 2024). Recognizing the pivotal role of training data in shaping model performance, there are fast-growing efforts to curate datasets of larger scales and higher quality (Jakubik et al., 2024). However, the development trajectories of these models and datasets have historically diverged, guided more by intuition than by systematic co-development methodologies. Recent advances in enhancing multi-modal generative models tend to be either model-centric or data-centric, rarely bridging the two aspects cohesively. For example, model-centric methods focus on algorithmic enhancements and architectural innovations under fixed data priors, while data-centric strategies usually concentrate on processing and cleaning datasets independently of specific model training contexts (Qin et al., 2024). Both approaches usually suffer from a lack of systematic guidance and cooperative synergy, relying heavily on heuristic exploration and single-perspective expertise. This fragmented landscape presents a significant barrier to achieving optimal model performance, as the interplay between data characteristics and model capabilities remains largely underexploited. Moreover, the practical implementation of multi-modal generative models is further complicated by infrastructure constraints, escalating computational costs, and the accelerating pace of development cycles (Xu et al., 2024b).

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jul-16-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report > Promising Solution (0.45)
- Workflow (0.89)

Industry:
- Education > Educational Technology > Educational Software (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.48)
  - Natural Language
    - Generation (0.95)
    - Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found