unbalanced dataset
Assessing reliability of explanations in unbalanced datasets: a use-case on the occurrence of frost events
Vascotto, Ilaria, Blasone, Valentina, Rodriguez, Alex, Bonaita, Alessandro, Bortolussi, Luca
The usage of eXplainable Artificial Intelligence (XAI) methods has become essential in practical applications, given the increasing deployment of Artificial Intelligence (AI) models and the legislative requirements put forward in the latest years. A fundamental but often underestimated aspect of the explanations is their robustness, a key property that should be satisfied in order to trust the explanations. In this study, we provide some preliminary insights on evaluating the reliability of explanations in the specific case of unbalanced datasets, which are very frequent in high-risk use-cases, but at the same time considerably challenging for both AI models and XAI methods. We propose a simple evaluation focused on the minority class (i.e. the less frequent one) that leverages on-manifold generation of neighbours, explanation aggregation and a metric to test explanation consistency. We present a use-case based on a tabular dataset with numerical features focusing on the occurrence of frost events.
- Europe > Italy > Friuli Venezia Giulia > Trieste Province > Trieste (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (4 more...)
29405e2a4c22866a205f557559c7fa4b-AuthorFeedback.pdf
We thank the reviewers for their valuable feedback. We will add a few sentences in the Introduction of the final paper to further emphasize this point. A comparative assessment will be included in the final paper. R1.4 Typo errors in the caption of Figure 1: We have rectified the typographical error. Unlike discriminative learning approaches, the emphasis here is on the generative aspect (cf. Also refer to R1.1 and R1.5 above.
DP-TabICL: In-Context Learning with Differentially Private Tabular Data
Carey, Alycia N., Bhaila, Karuna, Edemacu, Kennedy, Wu, Xintao
In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks by conditioning on demonstrations of question-answer pairs and it has been shown to have comparable performance to costly model retraining and fine-tuning. Recently, ICL has been extended to allow tabular data to be used as demonstration examples by serializing individual records into natural language formats. However, it has been shown that LLMs can leak information contained in prompts, and since tabular data often contain sensitive information, understanding how to protect the underlying tabular data used in ICL is a critical area of research. This work serves as an initial investigation into how to use differential privacy (DP) -- the long-established gold standard for data privacy and anonymization -- to protect tabular data used in ICL. Specifically, we investigate the application of DP mechanisms for private tabular ICL via data privatization prior to serialization and prompting. We formulate two private ICL frameworks with provable privacy guarantees in both the local (LDP-TabICL) and global (GDP-TabICL) DP scenarios via injecting noise into individual records or group statistics, respectively. We evaluate our DP-based frameworks on eight real-world tabular datasets and across multiple ICL and DP settings. Our evaluations show that DP-based ICL can protect the privacy of the underlying tabular data while achieving comparable performance to non-LLM baselines, especially under high privacy regimes.
- North America > United States > Arkansas (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
Performance Analysis of Support Vector Machine (SVM) on Challenging Datasets for Forest Fire Detection
Kar, Ankan, Nath, Nirjhar, Kemprai, Utpalraj, Aman, null
This article delves into the analysis of performance and utilization of Support Vector Machines (SVMs) for the critical task of forest fire detection using image datasets. With the increasing threat of forest fires to ecosystems and human settlements, the need for rapid and accurate detection systems is of utmost importance. SVMs, renowned for their strong classification capabilities, exhibit proficiency in recognizing patterns associated with fire within images. By training on labeled data, SVMs acquire the ability to identify distinctive attributes associated with fire, such as flames, smoke, or alterations in the visual characteristics of the forest area. The document thoroughly examines the use of SVMs, covering crucial elements like data preprocessing, feature extraction, and model training. It rigorously evaluates parameters such as accuracy, efficiency, and practical applicability. The knowledge gained from this study aids in the development of efficient forest fire detection systems, enabling prompt responses and improving disaster management. Moreover, the correlation between SVM accuracy and the difficulties presented by high-dimensional datasets is carefully investigated, demonstrated through a revealing case study. The relationship between accuracy scores and the different resolutions used for resizing the training datasets has also been discussed in this article. These comprehensive studies result in a definitive overview of the difficulties faced and the potential sectors requiring further improvement and focus.
Alleviating the Effect of Data Imbalance on Adversarial Training
Li, Guanlin, Xu, Guowen, Zhang, Tianwei
In this paper, we study adversarial training on datasets that obey the long-tailed distribution, which is practical but rarely explored in previous works. Compared with conventional adversarial training on balanced datasets, this process falls into the dilemma of generating uneven adversarial examples (AEs) and an unbalanced feature embedding space, causing the resulting model to exhibit low robustness and accuracy on tail data. To combat that, we theoretically analyze the lower bound of the robust risk to train a model on a long-tailed dataset to obtain the key challenges in addressing the aforementioned dilemmas. Based on it, we propose a new adversarial training framework -- Re-balancing Adversarial Training (REAT). This framework consists of two components: (1) a new training strategy inspired by the effective number to guide the model to generate more balanced and informative AEs; (2) a carefully constructed penalty function to force a satisfactory feature space. Evaluation results on different datasets and model structures prove that REAT can effectively enhance the model's robustness and preserve the model's clean accuracy. The code can be found in https://github.com/GuanlinLee/REAT.
Credit Card Fraud Detection
The dataset, available at Kaggle, is originated from European Credit Card companies. It contains financial transactions for a two-day period, where 492 frauds were detected among nearly 290,000 transactions. As we can already notice, this is an unbalanced dataset, where fraud accounts for only 0.17% of the total. Another detail is that the features are all numerical and have been mischaracterized (due to privacy and security issues). The original data page reports that the variables were transformed using the Principal Component Analysis (PCA).
- Law Enforcement & Public Safety > Fraud (1.00)
- Information Technology > Security & Privacy (1.00)
- Banking & Finance (0.92)
Neural Network Classifier as Mutual Information Evaluator
Qin, Zhenyue, Kim, Dongwoo, Gedeon, Tom
Cross-entropy loss with softmax output is a standard choice to train neural network classifiers. We give a new view of neural network classifiers with softmax and cross-entropy as mutual information evaluators. We show that when the dataset is balanced, training a neural network with cross-entropy maximises the mutual information between inputs and labels through a variational form of mutual information. Thereby, we develop a new form of softmax that also converts a classifier to a mutual information evaluator when the dataset is imbalanced. Experimental results show that the new form leads to better classification accuracy, in particular for imbalanced datasets.
Not All Mistakes Are Created Equal: Cost-sensitive Learning
In classification problems, we often assume that every misclassification is equally bad. Consider the example of trying to classify whether or not there is a terrorist threat. There are two types of misclassifications: either we predict there is a threat but there is actually no threat (false positive), or we predict there is no threat but there actually is a threat (false negative). Clearly the false negative is much more dangerous than the false positive -- we might end up wasting time and money in the false positive case, but people might die in the false negative case. We call classification problems like this cost-sensitive.