Machine learning algorithms designed to characterize, monitor, and intervene on human health (ML4H) are expected to perform safely and reliably when operating at scale, potentially outside strict human supervision. This requirement warrants a stricter attention to issues of reproducibility than other fields of machine learning. In this work, we conduct a systematic evaluation of over 100 recently published ML4H research papers along several dimensions related to reproducibility. We find that the field of ML4H compares poorly to more established machine learning fields, particularly concerning data and code accessibility. Finally, drawing from success in other fields of science, we propose recommendations to data providers, academic publishers, and the ML4H research community in order to promote reproducible research moving forward.
In this paper, we discuss the approaches we took and trade-offs involved in making a paper on a conceptual topic in pattern recognition research fully reproducible. We discuss our definition of reproducibility, the tools used, how the analysis was set up, show some examples of alternative analyses the code enables and discuss our views on reproducibility.
Although the importance of multiple studies corroborating a given result is acknowledged in virtually all of the sciences (Figure 1), the modern use of "reproducible research" was originally applied not to corroboration, but to transparency, with application in the computational sciences. Computer scientist Jon Claerbout coined the term and associated it with a software platform and set of procedures that permit the reader of a paper to see the entire processing trail from the raw data and code to figures and tables (4). This concept has been carried forward into many data-intensive domains, including epidemiology (5), computational biology (6), economics (7), and clinical trials (8). According to a U.S. National Science Foundation (NSF) subcommittee on replicability in science (9), "reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results…. Reproducibility is a minimum necessary condition for a finding to be believable and informative."
Surprisingly, these two aspects are often underestimated or not even considered when setting up scientific experimental pipelines. In this, one of the main threat to replicability is the selection bias, that is the error in choosing the individuals or groups to take part in a study. Selection bias may come in different flavours: the selection of the population of samples in the dataset (sample bias); the selection of features used by the learning models, particularly sensible in case of high dimensionality; the selection of hyper parameter best performing on specific dataset(s). If not properly considered, the selection bias may strongly affect the validity of derived conclusions, as well as the reliability of the learning model. In this talk I will provide a solid introduction to the topics of reproducibility and selection bias, with examples taken from the biomedical research, in which reliability is paramount.
Artificial Intelligence is growing up fast. Although modern computers were only invented in the mid 20th century, they have already evolved into the complex machines we rely on today. Artificial Intelligence now governs a large proportion of consumer and business behaviour: from the way we use the internet, manufacture goods, and even hire and fire our workforces. However, as with any technology, when things grow too quickly, problems can arise. Artificial Intelligence as a scientific discipline might be struggling to keep up with the pace of change.