Last year the United States Food and Drug Administration (FDA) cleared a total of 12 AI tools that use machine learning for health (ML4H) algorithms to inform medical diagnosis and treatment for patients. The tools are now allowed to be marketed, with millions of potential users in the US alone.Because ML4H tools directly affect human health, their development from experiments in labs to deployment in hospitals progresses under heavy scrutiny. A critical component of this process is reproducibility. A team of researchers from MIT, University of Toronto, New York University, and Evidation Health have proposed a number of "recommendations to data providers, academic publishers, and the ML4H research community in order to promote reproducible research moving forward" in their new paper Reproducibility in Machine Learning for Health. Just as boxers show their strength in the ring by getting up again after being knocked to the canvas, researchers test their strength in the arena of science by ensuring their work's reproducibility.
In this paper, we discuss the approaches we took and trade-offs involved in making a paper on a conceptual topic in pattern recognition research fully reproducible. We discuss our definition of reproducibility, the tools used, how the analysis was set up, show some examples of alternative analyses the code enables and discuss our views on reproducibility.
Being able to reproduce research is a key aspect of creating knowledge. If a study can be reproduced by another lab then the validity of the findings are confirmed. This is particularly important in AI research with questions around explainable and trustworthy AI. There are a number of different ways to refer to reproducibility, in this piece we are actually referring to replicability using the standard ACM definition. It refers to research that reuses the data and/or analysis to hopefully get the same results.
As researchers and practitioners of applied machine learning, we are given a set of requirements on the problem to be solved, the plausibly obtainable data, and the computational resources available. We aim to find (within those bounds) reliably useful combinations of problem, data, and algorithm. An emphasis on algorithmic or technical novelty in ML conference publications leads to exploration of one dimension of this space. Data collection and ML deployment at scale in industry settings offers an environment for exploring the others. Our conferences and reviewing criteria can better support empirical ML by soliciting and incentivizing experimentation and synthesis independent of algorithmic innovation.