Accuracy
Two-step counterfactual generation for OOD examples
Keshtmand, Nawid, Santos-Rodriguez, Raul, Lawry, Jonathan
However, they still make erroneous predictions when exposed to inputs from an unfamiliar distribution. This poses a significant obstacle to the deployment of ML models in safety-critical applications such as healthcare and autonomous vehicles. Consequently, for applications in these domains, two fundamental requirements for the deployment of ML models are; 1) being able to identify data that is from a different distribution from the data on which the model was trained, which is referred to as out-of-distribution (OOD) detection, outlier detection, or anomaly detection [30]; 2) being able to explain the prediction of the model [24]. There has been significant work on improving the accuracy of OOD detectors although, there has not been much work on explaining why a data point is OOD [20]. As OOD detection algorithms are increasingly used in safety-critical domains, providing explanations for high-stakes decisions has become an ethical and regulatory requirement [26]. Therefore, it is important to develop methods that provide both accurate OOD scores and also provide an explanation of why specific data points are detected as OOD. OOD detection can be considered a binary classification problem, where a data point can belong either to the in-distribution (ID) class or to the OOD class [4]. Additionally, there are different versions of the OOD detection problem, which are referred to as near-OOD and far-OOD detection [23, 29]. OOD data points that have neither non-discriminative (class-irrelevant) nor discriminative (class-relevant) features are referred to as far-OOD data and are therefore very dissimilar to the ID data.
AIROGS: Artificial Intelligence for RObust Glaucoma Screening Challenge
de Vente, Coen, Vermeer, Koenraad A., Jaccard, Nicolas, Wang, He, Sun, Hongyi, Khader, Firas, Truhn, Daniel, Aimyshev, Temirgali, Zhanibekuly, Yerkebulan, Le, Tien-Dung, Galdran, Adrian, Ballester, Miguel รngel Gonzรกlez, Carneiro, Gustavo, G, Devika R, S, Hrishikesh P, Puthussery, Densen, Liu, Hong, Yang, Zekang, Kondo, Satoshi, Kasai, Satoshi, Wang, Edward, Durvasula, Ashritha, Heras, Jรณnathan, Zapata, Miguel รngel, Araรบjo, Teresa, Aresta, Guilherme, Bogunoviฤ, Hrvoje, Arikan, Mustafa, Lee, Yeong Chan, Cho, Hyun Bin, Choi, Yoon Ho, Qayyum, Abdul, Razzak, Imran, van Ginneken, Bram, Lemij, Hans G., Sรกnchez, Clara I.
The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cost-effective manner, making glaucoma screening more accessible. While AI models for glaucoma screening from CFPs have shown promising results in laboratory settings, their performance decreases significantly in real-world scenarios due to the presence of out-of-distribution and low-quality images. To address this issue, we propose the Artificial Intelligence for Robust Glaucoma Screening (AIROGS) challenge. This challenge includes a large dataset of around 113,000 images from about 60,000 patients and 500 different screening centers, and encourages the development of algorithms that are robust to ungradable and unexpected input data. We evaluated solutions from 14 teams in this paper, and found that the best teams performed similarly to a set of 20 expert ophthalmologists and optometrists. The highest-scoring team achieved an area under the receiver operating characteristic curve of 0.99 (95% CI: 0.98-0.99) for detecting ungradable images on-the-fly. Additionally, many of the algorithms showed robust performance when tested on three other publicly available datasets. These results demonstrate the feasibility of robust AI-enabled glaucoma screening.
DIWIFT: Discovering Instance-wise Influential Features for Tabular Data
Liu, Dugang, Cheng, Pengxiang, Zhu, Hong, Tang, Xing, Chen, Yanyu, Wang, Xiaoting, Pan, Weike, Ming, Zhong, He, Xiuqiang
Tabular data is one of the most common data storage formats behind many real-world web applications such as retail, banking, and e-commerce. The success of these web applications largely depends on the ability of the employed machine learning model to accurately distinguish influential features from all the predetermined features in tabular data. Intuitively, in practical business scenarios, different instances should correspond to different sets of influential features, and the set of influential features of the same instance may vary in different scenarios. However, most existing methods focus on global feature selection assuming that all instances have the same set of influential features, and few methods considering instance-wise feature selection ignore the variability of influential features in different scenarios. In this paper, we first introduce a new perspective based on the influence function for instance-wise feature selection, and give some corresponding theoretical insights, the core of which is to use the influence function as an indicator to measure the importance of an instance-wise feature. We then propose a new solution for discovering instance-wise influential features in tabular data (DIWIFT), where a self-attention network is used as a feature selection model and the value of the corresponding influence function is used as an optimization objective to guide the model. Benefiting from the advantage of the influence function, i.e., its computation does not depend on a specific architecture and can also take into account the data distribution in different scenarios, our DIWIFT has better flexibility and robustness. Finally, we conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness of our DIWIFT.
A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning Classifiers
Chen, Zhenpeng, Zhang, Jie M., Sarro, Federica, Harman, Mark
Software bias is an increasingly important operational concern for software engineers. We present a large-scale, comprehensive empirical study of 17 representative bias mitigation methods for Machine Learning (ML) classifiers, evaluated with 11 ML performance metrics (e.g., accuracy), 4 fairness metrics, and 20 types of fairness-performance trade-off assessment, applied to 8 widely-adopted software decision tasks. The empirical coverage is much more comprehensive, covering the largest numbers of bias mitigation methods, evaluation metrics, and fairness-performance trade-off measures compared to previous work on this important software property. We find that (1) the bias mitigation methods significantly decrease ML performance in 53% of the studied scenarios (ranging between 42%~66% according to different ML performance metrics); (2) the bias mitigation methods significantly improve fairness measured by the 4 used metrics in 46% of all the scenarios (ranging between 24%~59% according to different fairness metrics); (3) the bias mitigation methods even lead to decrease in both fairness and ML performance in 25% of the scenarios; (4) the effectiveness of the bias mitigation methods depends on tasks, models, the choice of protected attributes, and the set of metrics used to assess fairness and ML performance; (5) there is no bias mitigation method that can achieve the best trade-off in all the scenarios. The best method that we find outperforms other methods in 30% of the scenarios. Researchers and practitioners need to choose the bias mitigation method best suited to their intended application scenario(s).
Hierarchical classification at multiple operating points
Many classification problems consider classes that form a hierarchy. Classifiers that are aware of this hierarchy may be able to make confident predictions at a coarse level despite being uncertain at the fine-grained level. While it is generally possible to vary the granularity of predictions using a threshold at inference time, most contemporary work considers only leaf-node prediction, and almost no prior work has compared methods at multiple operating points. We present an efficient algorithm to produce operating characteristic curves for any method that assigns a score to every class in the hierarchy. Applying this technique to evaluate existing methods reveals that top-down classifiers are dominated by a naive flat softmax classifier across the entire operating range. We further propose two novel loss functions and show that a soft variant of the structured hinge loss is able to significantly outperform the flat baseline. Finally, we investigate the poor accuracy of top-down classifiers and demonstrate that they perform relatively well on unseen classes. Code is available online at https://github.com/jvlmdr/hiercls.
Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models
Polak, Maciej P., Modi, Shrey, Latosinska, Anna, Zhang, Jinming, Wang, Ching-Wen, Wang, Shanonan, Hazra, Ayan Deep, Morgan, Dane
Accurate and comprehensive material databases extracted from research papers are critical for materials science and engineering but require significant human effort to develop. In this paper we present a simple method of extracting materials data from full texts of research papers suitable for quickly developing modest-sized databases. The method requires minimal to no coding, prior knowledge about the extracted property, or model training, and provides high recall and almost perfect precision in the resultant database. The method is fully automated except for one human-assisted step, which typically requires just a few hours of human labor. The method builds on top of natural language processing and large general language models but can work with almost any such model. The language models GPT-3/3.5, bart and DeBERTaV3 are evaluated here for comparison. We provide a detailed detailed analysis of the methods performance in extracting bulk modulus data, obtaining up to 90% precision at 96% recall, depending on the amount of human effort involved. We then demonstrate the methods broader effectiveness by developing a database of critical cooling rates for metallic glasses.
Liver Segmentation using Turbolift Learning for CT and Cone-beam C-arm Perfusion Imaging
Haseljiฤ, Hana, Chatterjee, Soumick, Frysch, Robert, Kulvait, Vojtฤch, Semshchikov, Vladimir, Hensen, Bennet, Wacker, Frank, Brรผsch, Inga, Werncke, Thomas, Speck, Oliver, Nรผrnberger, Andreas, Rose, Georg
Potentially it might also serve for diagnosing liver diseases. The experimental C-arm CBCTp Computed Tomography (CT) perfusion or CTp imaging is a scanning protocol of the liver consists of multiple bidirectional method that can be used for the diagnosis and treatment planning rotations with pauses in between (Datta et al., 2017), which, of liver tumours. C-arm cone-beam CT, referred to here in combined with slow rotation, results in a very limited number short as CBCT, on the other hand, can be advantageous during of projections. A simplified approach would be to reconstruct interventions as the acquisitions can be done without moving every rotation separately, the straightforward approach, the patient due to the availability of CBCT as a part of the interventional which can result in over or underestimation of perfusion parameters suites (Orth et al., 2008). It has been shown that (Haseljiฤ et al., 2021). Recent publications have shown CBCT perfusion maps of the brain would not be inferior to that model-based reconstruction and time separation technique the CT perfusion maps (Niu et al., 2016), and when CT perfusion (TST) could deal with poor temporal resolution (Montes and scans are acquired soon enough, it could the patient's Lauritsch, 2009; Neukirchen et al., 2010; Manhart et al., 2013; life (Powers et al., 2019). C-arm CBCT perfusion (CBCTp) Bannasch et al., 2018; Kulvait et al., 2022; Haseljiฤ et al., 2021, imaging of the liver could allow inspection and evaluation of 2022) and provide highly accurate liver perfusion maps.
Data Augmentation for Robust Character Detection in Fantasy Novels
Amalvy, Arthur, Labatut, Vincent, Dufour, Richard
Named Entity Recognition (NER) is a low-level task often used as a foundation for solving higher level NLP problems. In the context of character detection in novels, NER false negatives can be an issue as they possibly imply missing certain characters or relationships completely. In this article, we demonstrate that applying a straightforward data augmentation technique allows training a model achieving higher recall, at the cost of a certain amount of precision regarding ambiguous entities. We show that this decrease in precision can be mitigated by giving the model more local context, which resolves some of the ambiguities.
On Fairness and Stability: Is Estimator Variance a Friend or a Foe?
Khan, Falaah Arif, Herasymuk, Denys, Stoyanovich, Julia
The error of an estimator can be decomposed into a (statistical) bias term, a variance term, and an irreducible noise term. When we do bias analysis, formally we are asking the question: "how good are the predictions?" The role of bias in the error decomposition is clear: if we trust the labels/targets, then we would want the estimator to have as low bias as possible, in order to minimize error. Fair machine learning is concerned with the question: "Are the predictions equally good for different demographic/social groups?" This has naturally led to a variety of fairness metrics that compare some measure of statistical bias on subsets corresponding to socially privileged and socially disadvantaged groups. In this paper we propose a new family of performance measures based on group-wise parity in variance. We demonstrate when group-wise statistical bias analysis gives an incomplete picture, and what group-wise variance analysis can tell us in settings that differ in the magnitude of statistical bias. We develop and release an open-source library that reconciles uncertainty quantification techniques with fairness analysis, and use it to conduct an extensive empirical analysis of our variance-based fairness measures on standard benchmarks.
REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines
Abdelaal, Mohamed, Hammacher, Christian, Schoening, Harald
Nowadays, machine learning (ML) plays a vital role in many aspects of our daily life. In essence, building well-performing ML applications requires the provision of high-quality data throughout the entire life-cycle of such applications. Nevertheless, most of the real-world tabular data suffer from different types of discrepancies, such as missing values, outliers, duplicates, pattern violation, and inconsistencies. Such discrepancies typically emerge while collecting, transferring, storing, and/or integrating the data. To deal with these discrepancies, numerous data cleaning methods have been introduced. However, the majority of such methods broadly overlook the requirements imposed by downstream ML models. As a result, the potential of utilizing these data cleaning methods in ML pipelines is predominantly unrevealed. In this work, we introduce a comprehensive benchmark, called REIN1, to thoroughly investigate the impact of data cleaning methods on various ML models. Through the benchmark, we provide answers to important research questions, e.g., where and whether data cleaning is a necessary step in ML pipelines. To this end, the benchmark examines 38 simple and advanced error detection and repair methods. To evaluate these methods, we utilized a wide collection of ML models trained on 14 publicly-available datasets covering different domains and encompassing realistic as well as synthetic error profiles.