Goto

Collaborating Authors

 maxcompute


Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models

Chen, Daoyuan, Huang, Yilun, Pan, Xuchen, Jiang, Nana, Wang, Haibin, Ge, Ce, Chen, Yushuo, Zhang, Wenhao, Ma, Zhijian, Zhang, Yilei, Huang, Jun, Lin, Wei, Li, Yaliang, Ding, Bolin, Zhou, Jingren

arXiv.org Artificial Intelligence

The burgeoning field of foundation models necessitates advanced data processing mechanisms capable of harnessing vast valuable data with varied types utilized by these models. Nevertheless, the current landscape presents unique challenges that traditional data processing frameworks cannot handle effectively, especially with multimodal intricacies. In response, we present Data-Juicer 2.0, a new system offering fruitful data processing capabilities backed by over a hundred operators spanning various modalities like text, image, audio, and video. With seamless compatibility and dedicated optimization to popular dataset hubs like Hugging Face and computing engines like Ray, Data-Juicer 2.0 enhances its predecessor in both usability, efficiency, and programmability. It features an easily accessible user interface layer that supports decoupled Python interactions, RESTful APIs, and conversational commands. Alongside this, it contains a core runtime layer optimized for adaptive execution and management across different dataset scales, processing demands, and computational environments, while shielding unnecessary system details. Extensive empirical evaluations demonstrate Data-Juicer 2.0's remarkable performance and scalability, highlighting its capability to efficiently process tens of billions of data samples with tens of thousands of CPU cores. The system is publicly available, actively maintained, and broadly adopted in diverse research endeavors, practical applications, and real-world products such as Alibaba Cloud PAI.


Integrating AI with SaaS-based Cloud Data Warehouses

#artificialintelligence

This article discusses the definition of SaaS-based cloud data warehouse integrated with AI, by Meng Shuo, MaxCompute product manager of the Alibaba Cloud business unit. Artificial intelligence (AI) is a concept that emerged as early as the 1950s. After that, due to various reasons, AI experienced a long process of dormancy for decades. It is not until the last few years that AI became popular again. In fact, AI has actually enjoyed three "golden periods" of development in its history.