On the calibration of Just-in-time Defect Prediction

Shahini, Xhulja, Bartel, Jone, Pohl, Klaus

Apr-17-2025–arXiv.org Artificial Intelligence

--Just-in-time defect prediction (JIT DP) leverages machine learning to identify defect-prone code commits, enabling quality assurance (QA) teams to allocate resources more efficiently by focusing on commits that are most likely to contain defects. Although JIT defect prediction techniques have introduced notable improvements in terms of predictive accuracy, they are still susceptible to misclassification errors such as false positives and false negatives. This can lead to wasted resources or undetected defects, a particularly critical concern when QA resources are limited. T o mitigate these challenges and preserve the practical utility of JIT defect prediction tools, it becomes essential to estimate the reliability of the predictions, i.e., computing confidence scores. Such scores can help practitioners determine the trustworthiness of predictions and and thus prioritize them efficiently. A simple approach to computing confidence scores is to extract, alongside each prediction, the corresponding prediction probabilities and use them as indicators of confidence. However, for these probabilities to reliably serve as confidence scores, the predictive model must be well-calibrated. This means that the prediction probabilities must accurately represent the true likelihood of each prediction being correct. Miscalibration, common in modern machine learning models, distorts probability scores such that the model's prediction probabilities do not align with the actual probability of those predictions being correct. Despite its importance, model calibration has been largely overlooked in JIT defect prediction. In this study, we evaluate the calibration of several state-of-the-art JIT defect prediction techniques to determine whether and to what extent they exhibit poor calibration. Furthermore, we assess whether post-calibration methods can improve the calibration of existing JIT defect prediction models. Our experimental analysis reveals that all evaluated JIT DP models exhibit some level of miscalibration, with Expected Calibration Error (ECE) ranging from 2% to 35%. Furthermore, post-calibration methods do not consistently improve the calibration of these JIT DP models. In recent years, just-in-time defect prediction (JIT DP) has emerged as a valuable machine learning (ML)-based technique, designed to predict whether a code commit is defect-prone or clean. By identifying code commits that are more likely to contain defects, JIT defect prediction helps quality assurance (QA) practitioners decide whether to perform targeted inspections and code reviews, as well as where and how to allocate testing efforts and resources [3], [4]. By supporting the prioritization of the code commits for further investigation and testing, JIT defect prediction models enable the timely identification of defects in the codebase. JIT defect prediction thus provides a means to optimize QA workflows.

artificial intelligence, machine learning, prediction, (15 more...)

arXiv.org Artificial Intelligence

Apr-17-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.89)

Industry:
- Health & Medicine (0.67)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found