Artificial Intelligence is growing up fast. Although modern computers were only invented in the mid 20th century, they have already evolved into the complex machines we rely on today. Artificial Intelligence now governs a large proportion of consumer and business behaviour: from the way we use the internet, manufacture goods, and even hire and fire our workforces. However, as with any technology, when things grow too quickly, problems can arise. Artificial Intelligence as a scientific discipline might be struggling to keep up with the pace of change.
Machine learning algorithms designed to characterize, monitor, and intervene on human health (ML4H) are expected to perform safely and reliably when operating at scale, potentially outside strict human supervision. This requirement warrants a stricter attention to issues of reproducibility than other fields of machine learning. In this work, we conduct a systematic evaluation of over 100 recently published ML4H research papers along several dimensions related to reproducibility. We find that the field of ML4H compares poorly to more established machine learning fields, particularly concerning data and code accessibility. Finally, drawing from success in other fields of science, we propose recommendations to data providers, academic publishers, and the ML4H research community in order to promote reproducible research moving forward.
Although the importance of multiple studies corroborating a given result is acknowledged in virtually all of the sciences (Figure 1), the modern use of "reproducible research" was originally applied not to corroboration, but to transparency, with application in the computational sciences. Computer scientist Jon Claerbout coined the term and associated it with a software platform and set of procedures that permit the reader of a paper to see the entire processing trail from the raw data and code to figures and tables (4). This concept has been carried forward into many data-intensive domains, including epidemiology (5), computational biology (6), economics (7), and clinical trials (8). According to a U.S. National Science Foundation (NSF) subcommittee on replicability in science (9), "reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results…. Reproducibility is a minimum necessary condition for a finding to be believable and informative."
According to the Big Five theory of personality, personality traits can be organized into five primary dimensions, including extraversion, agreeableness, conscientiousness, neuroticism, and openness to experience. These dimensions are associated with myriad life outcomes, such as job satisfaction or health. Soto conducted a replication study of 78 previously published personality–life outcome findings after high-profile failures by others to replicate studies in other areas of psychology. Of personality–life outcome effects, 87% replicated successfully with effect sizes that were 77% as large as those in the original studies. Replicability was predicted by features of the original studies and the replication studies.
There is a lively debate all over the world regarding AI's perceived "black box" problem. Most profoundly, if a machine can be taught to learn itself, how does it explain its conclusions? This issue comes up most frequently in the context of how to address possible algorithmic bias. One way to address this issue is to mandate a right to a human decision per the General Data Protection Regulation's (GDPR) Article 22. Here in the United States, Senators Wyden and Booker propose in the Algorithmic Accountability Act that companies be compelled to conduct impact assessments.