From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience

Li, Zhiwei, Kesselman, Carl, Nguyen, Tran Huy, Xu, Benjamin Yixing, Bolo, Kyle, Yu, Kimberley

arXiv.org Artificial Intelligence 

--Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled V ocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative exploration, improves reproducibility, and preserves the provenance of collaborative decisions across the ML lifecycle. As machine learning (ML) becomes increasingly central to scientific discovery, concerns about correctness and reproducibility have grown [1]. In eScience, ML development is typically a collaborative and iterative process involving domain experts, data engineers, and ML researchers. These teams refine models based on evolving hypotheses and new data, creating feedback loops across data curation, feature engineering, modeling, and evaluation [2]. This dynamic process frequently introduces data cascades, where early curation errors propagate downstream, compounding over time [3]. In practice, ML workflows remain fragmented: datasets are shared informally, experiments span personal and cloud environments, and data, code, and configurations are often loosely coupled [4]. While MLOps and data management tools address parts of this problem, such as code versioning, pipeline orchestration, or environment encapsulation, they often overlook the full scientific lifecycle and the socio-technical realities of collaborative ML projects [5]. In prior work, we introduced Deriva-ML [6], a socio-technical platform that extends the FAIR principles (Findable, Accessible, Interoperable, Reusable) [7] across the ML developmental lifecycle.