Goto

Collaborating Authors

 Gundersen, Odd Erik


The Unreasonable Effectiveness of Open Science in AI: A Replication Study

arXiv.org Artificial Intelligence

A reproducibility crisis has been reported in science, but the extent to which it affects AI research is not yet fully understood. Therefore, we performed a systematic replication study including 30 highly cited AI studies relying on original materials when available. In the end, eight articles were rejected because they required access to data or hardware that was practically impossible to acquire as part of the project. Six articles were successfully reproduced, while five were partially reproduced. In total, 50% of the articles included was reproduced to some extent. The availability of code and data correlate strongly with reproducibility, as 86% of articles that shared code and data were fully or partly reproduced, while this was true for 33% of articles that shared only data. The quality of the data documentation correlates with successful replication. Poorly documented or miss-specified data will probably result in unsuccessful replication. Surprisingly, the quality of the code documentation does not correlate with successful replication. Whether the code is poorly documented, partially missing, or not versioned is not important for successful replication, as long as the code is shared. This study emphasizes the effectiveness of open science and the importance of properly documenting data work.


Probing the Robustness of Time-series Forecasting Models with CounterfacTS

arXiv.org Artificial Intelligence

A common issue for machine learning models applied to time-series forecasting is the temporal evolution of the data distributions (i.e., concept drift). Because most of the training data does not reflect such changes, the models present poor performance on the new out-of-distribution scenarios and, therefore, the impact of such events cannot be reliably anticipated ahead of time. We present and publicly release CounterfacTS, a tool to probe the robustness of deep learning models in time-series forecasting tasks via counterfactuals. CounterfacTS has a user-friendly interface that allows the user to visualize, compare and quantify time series data and their forecasts, for a number of datasets and deep learning models. Furthermore, the user can apply various transformations to the time series and explore the resulting changes in the forecasts in an interpretable manner. Through example cases, we illustrate how CounterfacTS can be used to i) identify the main features characterizing and differentiating sets of time series, ii) assess how the model performance depends on these characateristics, and iii) guide transformations of the original time series to create counterfactuals with desired properties for training and increasing the forecasting performance in new regions of the data distribution. We discuss the importance of visualizing and considering the location of the data in a projected feature space to transform time-series and create effective counterfactuals for training the models. Overall, CounterfacTS aids at creating counterfactuals to efficiently explore the impact of hypothetical scenarios not covered by the original data in time-series forecasting tasks.


Examining the Effect of Implementation Factors on Deep Learning Reproducibility

arXiv.org Artificial Intelligence

Reproducing published deep learning papers to validate their conclusions can be difficult due to sources of irreproducibility. We investigate the impact that implementation factors have on the results and how they affect reproducibility of deep learning studies. Three deep learning experiments were ran five times each on 13 different hardware environments and four different software environments. The analysis of the 780 combined results showed that there was a greater than 6% accuracy range on the same deterministic examples introduced from hardware or software environment variations alone. To account for these implementation factors, researchers should run their experiments multiple times in different hardware and software environments to verify their conclusions are not affected.


REFORMS: Reporting Standards for Machine Learning Based Science

arXiv.org Artificial Intelligence

Machine learning (ML) methods are proliferating in scientific research. However, the adoption of these methods has been accompanied by failures of validity, reproducibility, and generalizability. These failures can hinder scientific progress, lead to false consensus around invalid claims, and undermine the credibility of ML-based science. ML methods are often applied and fail in similar ways across disciplines. Motivated by this observation, our goal is to provide clear reporting standards for ML-based science. Drawing from an extensive review of past literature, we present the REFORMS checklist ($\textbf{Re}$porting Standards $\textbf{For}$ $\textbf{M}$achine Learning Based $\textbf{S}$cience). It consists of 32 questions and a paired set of guidelines. REFORMS was developed based on a consensus of 19 researchers across computer science, data science, mathematics, social sciences, and biomedical sciences. REFORMS can serve as a resource for researchers when designing and implementing a study, for referees when reviewing papers, and for journals when enforcing standards for transparency and reproducibility.


Sources of Irreproducibility in Machine Learning: A Review

arXiv.org Artificial Intelligence

Background: Many published machine learning studies are irreproducible. Issues with methodology and not properly accounting for variation introduced by the algorithm themselves or their implementations are attributed as the main contributors to the irreproducibility.Problem: There exist no theoretical framework that relates experiment design choices to potential effects on the conclusions. Without such a framework, it is much harder for practitioners and researchers to evaluate experiment results and describe the limitations of experiments. The lack of such a framework also makes it harder for independent researchers to systematically attribute the causes of failed reproducibility experiments. Objective: The objective of this paper is to develop a framework that enable applied data science practitioners and researchers to understand which experiment design choices can lead to false findings and how and by this help in analyzing the conclusions of reproducibility experiments. Method: We have compiled an extensive list of factors reported in the literature that can lead to machine learning studies being irreproducible. These factors are organized and categorized in a reproducibility framework motivated by the stages of the scientific method. The factors are analyzed for how they can affect the conclusions drawn from experiments. A model comparison study is used as an example. Conclusion: We provide a framework that describes machine learning methodology from experimental design decisions to the conclusions inferred from them.


The Fundamental Principles of Reproducibility

arXiv.org Artificial Intelligence

Reproducibility is a confused terminology. In this paper, I take a fundamental view on reproducibility rooted in the scientific method. The scientific method is analysed and characterised in order to develop the terminology required to define reproducibility. Further, the literature on reproducibility and replication is surveyed, and experiments are modeled as tasks and problem solving methods. Machine learning is used to exemplify the described approach. Based on the analysis, reproducibility is defined and three different types of reproducibility as well as four degrees of reproducibility are specified.


On Reproducible AI: Towards Reproducible Research, Open Science, and Digital Scholarship in AI Publications

AI Magazine

Background: Science is experiencing a reproducibility crisis. Artificial intelligence research is not an exception. Objective: To give practical and pragmatic recommendations for how to document AI research so that the results are reproducible. Method: Our analysis of the literature shows that AI publications fall short of providing enough documentation to facilitate reproducibility. Our suggested best practices are based on a framework for reproducibility and recommendations given for other disciplines. Results: We have made an author checklist based on our investigation and provided examples for how every item in the checklist can be documented. Conclusion: We encourage reviewers to use the suggested best practices and author checklist when reviewing submissions for AAAI publications and future AAAI conferences.


The 25th International Conference on Case-Based Reasoning

AI Magazine

Usually, a CBR process is composed of four steps, namely: retrieve (selection of one or several case(s) from the base); reuse (adaptation of the retrieved case(s) to solve the new problem); revise (presentation of the newly formed case to application domain experts and, as appropriate, corrections to it); and retain (addition of the revised case to the case base, if this addition is judged useful). CBR is an active field of ICCBR is not only an important venue for presenting research that is application-and theory-driven, and it CBRrelated research. It is also an important relates to both machine learning and knowledge representation. Generous funding from NTNU, the Norwegian Each day of the conference began with an invited Research Council, and our other sponsors allowed talk. On the first day, Henri Prade presented an introduction the conference to cover all the meals for the attendees to analogical proportions and analogical reasoning during the conference.


State of the Art: Reproducibility in Artificial Intelligence

AAAI Conferences

Background: Research results in artificial intelligence (AI) are criticized for not being reproducible. Objective: To quantify the state of reproducibility of empirical AI research using six reproducibility metrics measuring three different degrees of reproducibility. Hypotheses: 1) AI research is not documented well enough to reproduce the reported results. 2) Documentation practices have improved over time. Method: The literature is reviewed and a set of variables that should be documented to enable reproducibility are grouped into three factors: Experiment, Data and Method. The metrics describe how well the factors have been documented for a paper. A total of 400 research papers from the conference series IJCAI and AAAI have been surveyed using the metrics. Findings: None of the papers document all of the variables. The metrics show that between 20% and 30% of the variables for each factor are documented. One of the metrics show statistically significant increase over time while the others show no change. Interpretation: The reproducibility scores decrease with in- creased documentation requirements. Improvement over time is found. Conclusion: Both hypotheses are supported.


A Real-Time Decision Support System for High Cost Oil-Well Drilling Operations

AI Magazine

In this article we present DrillEdge -- a commercial and award winning software system that monitors oil-well drilling operations in order to reduce non-productive time (NPT). DrillEdge utilizes case-based reasoning with temporal representations on streaming real-time data, pattern matching and agent systems to predict problems and give advice on how to mitigate the problems. The methods utilized, the architecture, the GUI and development cost in addition to two case studies are documented.