Machine learning algorithms designed to characterize, monitor, and intervene on human health (ML4H) are expected to perform safely and reliably when operating at scale, potentially outside strict human supervision. This requirement warrants a stricter attention to issues of reproducibility than other fields of machine learning. In this work, we conduct a systematic evaluation of over 100 recently published ML4H research papers along several dimensions related to reproducibility. We find that the field of ML4H compares poorly to more established machine learning fields, particularly concerning data and code accessibility. Finally, drawing from success in other fields of science, we propose recommendations to data providers, academic publishers, and the ML4H research community in order to promote reproducible research moving forward.
Being able to reproduce research is a key aspect of creating knowledge. If a study can be reproduced by another lab then the validity of the findings are confirmed. This is particularly important in AI research with questions around explainable and trustworthy AI. There are a number of different ways to refer to reproducibility, in this piece we are actually referring to replicability using the standard ACM definition. It refers to research that reuses the data and/or analysis to hopefully get the same results.
In this paper, we discuss the approaches we took and trade-offs involved in making a paper on a conceptual topic in pattern recognition research fully reproducible. We discuss our definition of reproducibility, the tools used, how the analysis was set up, show some examples of alternative analyses the code enables and discuss our views on reproducibility.
Although the importance of multiple studies corroborating a given result is acknowledged in virtually all of the sciences (Figure 1), the modern use of "reproducible research" was originally applied not to corroboration, but to transparency, with application in the computational sciences. Computer scientist Jon Claerbout coined the term and associated it with a software platform and set of procedures that permit the reader of a paper to see the entire processing trail from the raw data and code to figures and tables (4). This concept has been carried forward into many data-intensive domains, including epidemiology (5), computational biology (6), economics (7), and clinical trials (8). According to a U.S. National Science Foundation (NSF) subcommittee on replicability in science (9), "reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results…. Reproducibility is a minimum necessary condition for a finding to be believable and informative."
Reproducibility and replicability are cornerstones of scientific inquiry. Although there is some debate on terminology and definitions, if something is reproducible, it means that the same result can be recreated by following a specific set of steps with a consistent dataset. If something is replicable, it means that the same conclusions or outcomes can be found using slightly different data or processes. Without reproducibility, process and findings can't be verified. Without replicability, it is difficult to trust the findings of a single study.