Data Readiness for Scientific AI at Scale
Brewer, Wesley, Widener, Patrick, Anantharaj, Valentine, Wang, Feiyi, Beck, Tom, Shankar, Arjun, Oral, Sarp
–arXiv.org Artificial Intelligence
This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.
arXiv.org Artificial Intelligence
Aug-1-2025
- Country:
- North America > United States > Tennessee > Anderson County > Oak Ridge (0.15)
- Genre:
- Research Report (1.00)
- Industry:
- Technology: