Performance Analysis
Augmented cross-selling through explainable AI -- a case from energy retailing
Haag, Felix, Hopf, Konstantin, Vasconcelos, Pedro Menelau, Staake, Thorsten
The advance of Machine Learning (ML) has led to a strong interest in this technology to support decision making. While complex ML models provide predictions that are often more accurate than those of traditional tools, such models often hide the reasoning behind the prediction from their users, which can lead to lower adoption and lack of insight. Motivated by this tension, research has put forth Explainable Artificial Intelligence (XAI) techniques that uncover patterns discovered by ML. Despite the high hopes in both ML and XAI, there is little empirical evidence of the benefits to traditional businesses. To this end, we analyze data on 220,185 customers of an energy retailer, predict cross-purchases with up to 86% correctness (AUC), and show that the XAI method SHAP provides explanations that hold for actual buyers.
An Empirical Analysis of the Efficacy of Different Sampling Techniques for Imbalanced Classification
Newaz, Asif, Hassan, Shahriar, Haq, Farhan Shahriyar
Learning from imbalanced data is a challenging task. Standard classification algorithms tend to perform poorly when trained on imbalanced data. Some special strategies need to be adopted, either by modifying the data distribution or by redesigning the underlying classification algorithm to achieve desirable performance. The prevalence of imbalance in real-world datasets has led to the creation of a multitude of strategies for the class imbalance issue. However, not all the strategies are useful or provide good performance in different imbalance scenarios. There are numerous approaches to dealing with imbalanced data, but the efficacy of such techniques or an experimental comparison among those techniques has not been conducted. In this study, we present a comprehensive analysis of 26 popular sampling techniques to understand their effectiveness in dealing with imbalanced data. Rigorous experiments have been conducted on 50 datasets with different degrees of imbalance to thoroughly investigate the performance of these techniques. A detailed discussion of the advantages and limitations of the techniques, as well as how to overcome such limitations, has been presented. We identify some critical factors that affect the sampling strategies and provide recommendations on how to choose an appropriate sampling technique for a particular application.
LPF-Defense: 3D Adversarial Defense based on Frequency Analysis
Naderi, Hanieh, Noorbakhsh, Kimia, Etemadi, Arian, Kasaei, Shohreh
Although 3D point cloud classification has recently been widely deployed in different application scenarios, it is still very vulnerable to adversarial attacks. This increases the importance of robust training of 3D models in the face of adversarial attacks. Based on our analysis on the performance of existing adversarial attacks, more adversarial perturbations are found in the mid and high-frequency components of input data. Therefore, by suppressing the high-frequency content in the training phase, the models robustness against adversarial examples is improved. Experiments showed that the proposed defense method decreases the success rate of six attacks on PointNet, PointNet++ ,, and DGCNN models. In particular, improvements are achieved with an average increase of classification accuracy by 3.8 % on drop100 attack and 4.26 % on drop200 attack compared to the state-of-the-art methods. The method also improves models accuracy on the original dataset compared to other available methods.
Next-Year Bankruptcy Prediction from Textual Data: Benchmark and Baselines
Arno, Henri, Mulier, Klaas, Baeck, Joke, Demeester, Thomas
Models for bankruptcy prediction are useful in several real-world scenarios, and multiple research contributions have been devoted to the task, based on structured (numerical) as well as unstructured (textual) data. However, the lack of a common benchmark dataset and evaluation strategy impedes the objective comparison between models. This paper introduces such a benchmark for the unstructured data scenario, based on novel and established datasets, in order to stimulate further research into the task. We describe and evaluate several classical and neural baseline models, and discuss benefits and flaws of different strategies. In particular, we find that a lightweight bag-of-words model based on static in-domain word representations obtains surprisingly good results, especially when taking textual data from several years into account. These results are critically assessed, and discussed in light of particular aspects of the data and the task. All code to replicate the data and experimental results will be released.
Automatic detection of faults in race walking from a smartphone camera: a comparison of an Olympic medalist and university athletes
Suzuki, Tomohiro, Takeda, Kazuya, Fujii, Keisuke
Automatic fault detection is a major challenge in many sports. In race walking, referees visually judge faults according to the rules. Hence, ensuring objectivity and fairness while judging is important. To address this issue, some studies have attempted to use sensors and machine learning to automatically detect faults. However, there are problems associated with sensor attachments and equipment such as a high-speed camera, which conflict with the visual judgement of referees, and the interpretability of the fault detection models. In this study, we proposed a fault detection system for non-contact measurement. We used pose estimation and machine learning models trained based on the judgements of multiple qualified referees to realize fair fault judgement. We verified them using smartphone videos of normal race walking and walking with intentional faults in several athletes including the medalist of the Tokyo Olympics. The validation results show that the proposed system detected faults with an average accuracy of over 90%. We also revealed that the machine learning model detects faults according to the rules of race walking. In addition, the intentional faulty walking movement of the medalist was different from that of university walkers. This finding informs realization of a more general fault detection model. The code and data are available at https://github.com/SZucchini/racewalk-aijudge.
Pushing the limits of fairness impossibility: Who's the fairest of them all?
Hsu, Brian, Mazumder, Rahul, Nandy, Preetam, Basu, Kinjal
The impossibility theorem of fairness is a foundational result in the algorithmic fairness literature. It states that outside of special cases, one cannot exactly and simultaneously satisfy all three common and intuitive definitions of fairness - demographic parity, equalized odds, and predictive rate parity. This result has driven most works to focus on solutions for one or two of the metrics. Rather than follow suit, in this paper we present a framework that pushes the limits of the impossibility theorem in order to satisfy all three metrics to the best extent possible. We develop an integer-programming based approach that can yield a certifiably optimal post-processing method for simultaneously satisfying multiple fairness criteria under small violations. We show experiments demonstrating that our post-processor can improve fairness across the different definitions simultaneously with minimal model performance reduction. We also discuss applications of our framework for model selection and fairness explainability, thereby attempting to answer the question: who's the fairest of them all?
Towards Data-Efficient Detection Transformers
Wang, Wen, Zhang, Jing, Cao, Yang, Shen, Yongliang, Tao, Dacheng
Detection Transformers have achieved competitive performance on the sample-rich COCO dataset. However, we show most of them suffer from significant performance drops on small-size datasets, like Cityscapes. In other words, the detection transformers are generally data-hungry. To tackle this problem, we empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR. The empirical results suggest that sparse feature sampling from local image areas holds the key. Based on this observation, we alleviate the data-hungry issue of existing detection transformers by simply alternating how key and value sequences are constructed in the cross-attention layer, with minimum modifications to the original models. Besides, we introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency. Experiments show that our method can be readily applied to different detection transformers and improve their performance on both small-size and sample-rich datasets. Code will be made publicly available at \url{https://github.com/encounter1997/DE-DETRs}.
Ontology-Driven Self-Supervision for Adverse Childhood Experiences Identification Using Social Media Datasets
Wu, Jinge, Smith, Rowena, Wu, Honghan
Adverse Childhood Experiences (ACEs) are defined as a collection of highly stressful, and potentially traumatic, events or circumstances that occur throughout childhood and/or adolescence. They have been shown to be associated with increased risks of mental health diseases or other abnormal behaviours in later lives. However, the identification of ACEs from textual data with Natural Language Processing (NLP) is challenging because (a) there are no NLP ready ACE ontologies; (b) there are few resources available for machine learning, necessitating the data annotation from clinical experts; (c) costly annotations by domain experts and large number of documents for supporting large machine learning models. In this paper, we present an ontology-driven self-supervised approach (derive concept embeddings using an auto-encoder from baseline NLP results) for producing a publicly available resource that would support large-scale machine learning (e.g., training transformer based large language models) on social media corpus. This resource as well as the proposed approach are aimed to facilitate the community in training transferable NLP models for effectively surfacing ACEs in low-resource scenarios like NLP on clinical notes within Electronic Health Records. The resource including a list of ACE ontology terms, ACE concept embeddings and the NLP annotated corpus is available at https://github.com/knowlab/ACE-NLP.
Deeply Supervised Skin Lesions Diagnosis with Stage and Branch Attention
Dai, Wei, Liu, Rui, Wu, Tianyi, Wang, Min, Yin, Jianqin, Liu, Jun
Accurate and unbiased examinations of skin lesions are critical for the early diagnosis and treatment of skin diseases. Visual features of skin lesions vary significantly because the images are collected from patients with different lesion colours and morphologies by using dissimilar imaging equipment. Recent studies have reported that ensembled convolutional neural networks (CNNs) are practical to classify the images for early diagnosis of skin disorders. However, the practical use of these ensembled CNNs is limited as these networks are heavyweight and inadequate for processing contextual information. Although lightweight networks (e.g., MobileNetV3 and EfficientNet) were developed to achieve parameters reduction for implementing deep neural networks on mobile devices, insufficient depth of feature representation restricts the performance. To address the existing limitations, we develop a new lite and effective neural network, namely HierAttn. The HierAttn applies a novel deep supervision strategy to learn the local and global features by using multi-stage and multi-branch attention mechanisms with only one training loss. The efficacy of HierAttn was evaluated by using the dermoscopy images dataset ISIC2019 and smartphone photos dataset PAD-UFES-20 (PAD2020). The experimental results show that HierAttn achieves the best accuracy and area under the curve (AUC) among the state-of-the-art lightweight networks. The code is available at https://github.com/anthonyweidai/HierAttn.
QU-BraTS: MICCAI BraTS 2020 Challenge on Quantifying Uncertainty in Brain Tumor Segmentation - Analysis of Ranking Scores and Benchmarking Results
Mehta, Raghav, Filos, Angelos, Baid, Ujjwal, Sako, Chiharu, McKinley, Richard, Rebsamen, Michael, Datwyler, Katrin, Meier, Raphael, Radojewski, Piotr, Murugesan, Gowtham Krishnan, Nalawade, Sahil, Ganesh, Chandan, Wagner, Ben, Yu, Fang F., Fei, Baowei, Madhuranthakam, Ananth J., Maldjian, Joseph A., Daza, Laura, Gomez, Catalina, Arbelaez, Pablo, Dai, Chengliang, Wang, Shuo, Reynaud, Hadrien, Mo, Yuan-han, Angelini, Elsa, Guo, Yike, Bai, Wenjia, Banerjee, Subhashis, Pei, Lin-min, AK, Murat, Rosas-Gonzalez, Sarahi, Zemmoura, Ilyess, Tauber, Clovis, Vu, Minh H., Nyholm, Tufve, Lofstedt, Tommy, Ballestar, Laura Mora, Vilaplana, Veronica, McHugh, Hugh, Talou, Gonzalo Maso, Wang, Alan, Patel, Jay, Chang, Ken, Hoebel, Katharina, Gidwani, Mishka, Arun, Nishanth, Gupta, Sharut, Aggarwal, Mehak, Singh, Praveer, Gerstner, Elizabeth R., Kalpathy-Cramer, Jayashree, Boutry, Nicolas, Huard, Alexis, Vidyaratne, Lasitha, Rahman, Md Monibor, Iftekharuddin, Khan M., Chazalon, Joseph, Puybareau, Elodie, Tochon, Guillaume, Ma, Jun, Cabezas, Mariano, Llado, Xavier, Oliver, Arnau, Valencia, Liliana, Valverde, Sergi, Amian, Mehdi, Soltaninejad, Mohammadreza, Myronenko, Andriy, Hatamizadeh, Ali, Feng, Xue, Dou, Quan, Tustison, Nicholas, Meyer, Craig, Shah, Nisarg A., Talbar, Sanjay, Weber, Marc-Andre, Mahajan, Abhishek, Jakab, Andras, Wiest, Roland, Fathallah-Shaykh, Hassan M., Nazeri, Arash, Milchenko1, Mikhail, Marcus, Daniel, Kotrotsou, Aikaterini, Colen, Rivka, Freymann, John, Kirby, Justin, Davatzikos, Christos, Menze, Bjoern, Bakas, Spyridon, Gal, Yarin, Arbel, Tal
Deep learning (DL) models have provided state-of-the-art performance in various medical imaging benchmarking challenges, including the Brain Tumor Segmentation (BraTS) challenges. However, the task of focal pathology multi-compartment segmentation (e.g., tumor and lesion sub-regions) is particularly challenging, and potential errors hinder translating DL models into clinical workflows. Quantifying the reliability of DL model predictions in the form of uncertainties could enable clinical review of the most uncertain regions, thereby building trust and paving the way toward clinical translation. Several uncertainty estimation methods have recently been introduced for DL medical image segmentation tasks. Developing scores to evaluate and compare the performance of uncertainty measures will assist the end-user in making more informed decisions. In this study, we explore and evaluate a score developed during the BraTS 2019 and BraTS 2020 task on uncertainty quantification (QU-BraTS) and designed to assess and rank uncertainty estimates for brain tumor multi-compartment segmentation. This score (1) rewards uncertainty estimates that produce high confidence in correct assertions and those that assign low confidence levels at incorrect assertions, and (2) penalizes uncertainty measures that lead to a higher percentage of under-confident correct assertions. We further benchmark the segmentation uncertainties generated by 14 independent participating teams of QU-BraTS 2020, all of which also participated in the main BraTS segmentation task. Overall, our findings confirm the importance and complementary value that uncertainty estimates provide to segmentation algorithms, highlighting the need for uncertainty quantification in medical image analyses.