imbalanced
- Europe > France (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- (3 more...)
- Information Technology > Security & Privacy (1.00)
- Banking & Finance (0.94)
Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation
Evaluating new techniques on realistic datasets plays a crucial role in the development of ML research and its broader adoption by practitioners. In recent years, there has been a significant increase of publicly available unstructured data resources for computer vision and NLP tasks. However, tabular data -- which is prevalent in many high-stakes domains -- has been lagging behind. To bridge this gap, we present Bank Account Fraud (BAF), the first publicly available 1 privacy-preserving, large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized,real-world bank account opening fraud detection dataset. This setting carries a set of challenges that are commonplace in real-world applications, including temporal dynamics and significant class imbalance. Additionally, to allow practitioners to stress test both performance and fairness of ML methods, each dataset variant of BAF contains specific types of data bias. With this resource, we aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.
CYCle: Choosing Your Collaborators Wisely to Enhance Collaborative Fairness in Decentralized Learning
Tastan, Nurbek, Horvath, Samuel, Nandakumar, Karthik
Collaborative learning (CL) enables multiple participants to jointly train machine learning (ML) models on decentralized data sources without raw data sharing. While the primary goal of CL is to maximize the expected accuracy gain for each participant, it is also important to ensure that the gains are fairly distributed. Specifically, no client should be negatively impacted by the collaboration, and the individual gains must ideally be commensurate with the contributions. Most existing CL algorithms require central coordination and focus on the gain maximization objective while ignoring collaborative fairness. In this work, we first show that the existing measure of collaborative fairness based on the correlation between accuracy values without and with collaboration has drawbacks because it does not account for negative collaboration gain. We argue that maximizing mean collaboration gain (MCG) while simultaneously minimizing the collaboration gain spread (CGS) is a fairer alternative. Next, we propose the CYCle protocol that enables individual participants in a private decentralized learning (PDL) framework to achieve this objective through a novel reputation scoring method based on gradient alignment between the local cross-entropy and distillation losses. Experiments on the CIFAR-10, CIFAR-100, and Fed-ISIC2019 datasets empirically demonstrate the effectiveness of the CYCle protocol to ensure positive and fair collaboration gain for all participants, even in cases where the data distributions of participants are highly skewed. For the simple mean estimation problem with two participants, we also theoretically show that CYCle performs better than standard FedAvg, especially when there is large statistical heterogeneity.
- North America > United States > Virginia (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > France (0.04)
- Asia > Singapore (0.04)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government (0.67)
- Law > Statutes (0.67)
- (2 more...)
Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation
Evaluating new techniques on realistic datasets plays a crucial role in the development of ML research and its broader adoption by practitioners. In recent years, there has been a significant increase of publicly available unstructured data resources for computer vision and NLP tasks. However, tabular data -- which is prevalent in many high-stakes domains -- has been lagging behind. To bridge this gap, we present Bank Account Fraud (BAF), the first publicly available 1 privacy-preserving, large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized,real-world bank account opening fraud detection dataset.
Offline Clustering Approach to Self-supervised Learning for Class-imbalanced Image Data
Chang, Hye-min, Chang, Sungkyun
Class-imbalanced datasets are known to cause the problem of model being biased towards the majority classes. In this project, we set up two research questions: 1) when is the class-imbalance problem more prevalent in self-supervised pre-training? and 2) can offline clustering of feature representations help pre-training on class-imbalanced data? Our experiments investigate the former question by adjusting the degree of {\it class-imbalance} when training the baseline models, namely SimCLR and SimSiam on CIFAR-10 database. To answer the latter question, we train each expert model on each subset of the feature clusters. We then distill the knowledge of expert models into a single model, so that we will be able to compare the performance of this model to our baselines.
Your Dataset Is Imbalanced? Do Nothing!
"We have a problem: this dataset is imbalanced." After telling this, they usually start proposing some balancing techniques to "fix" the dataset, such as undersampling, oversampling, SMOTE, and whatnot. I think this is one of the most widespread misconceptions in the machine learning community. Class imbalance is not a problem! In this article, I will explain to you why.