Goto

Collaborating Authors

 dataset bias


DPA: AOne-stop Metric to Measure Bias Amplification in Classification Datasets

Neural Information Processing Systems

Most ML datasets today contain biases. When we train models on these datasets, they often not only learn these biases but can worsen them -- a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification in classification datasets. They measure bias amplification between a protected attribute (e.g., gender) and a task (e.g., cooking). These metrics also support fine-grained bias analysis by identifying the direction in which a model amplifies biases. However, co-occurrence-based metrics have limitations -- some fail to measure bias amplification in balanced datasets, while others fail to measure negative bias amplification.


ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts

Neural Information Processing Systems

Dataset bias, where data points are skewed to certain concepts, is ubiquitous in machine learning datasets. Yet, systematically identifying these biases is challenging without costly, fine-grained attribute annotations. We present ConceptScope, a scalable and automated framework for analyzing visual datasets by discovering and quantifying human-interpretable concepts using Sparse Autoencoders trained on representations from vision foundation models. ConceptScope categorizes concepts into target, context, and bias types based on their semantic relevance and statistical correlation to class labels, enabling class-level dataset characterization, bias identification, and robustness evaluation through concept-based subgrouping.






LearningDebiasedandDisentangledRepresentations forSemanticSegmentation

Neural Information Processing Systems

Despite such phenomenal achievement, semantic segmentation approaches still suffer from the chronic limitations caused byclass imbalance andstereotyped scene contextindatasets.


A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models

Neural Information Processing Systems

Despite the remarkable success of pre-trained language models (PLMs), they still face two challenges: First, large-scale PLMs are inefficient in terms of memory footprint and computation. Second, on the downstream tasks, PLMs tend to rely on the dataset bias and struggle to generalize to out-of-distribution (OOD) data. In response to the efficiency problem, recent studies show that dense PLMs can be replaced with sparse subnetworks without hurting the performance. Such subnetworks can be found in three scenarios: 1) the fine-tuned PLMs, 2) the raw PLMs and then fine-tuned in isolation, and even inside 3) PLMs without any parameter fine-tuning. However, these results are only obtained in the in-distribution (ID) setting.


Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias

Neural Information Processing Systems

Data-driven approaches to solving robotic tasks have gained a lot of traction in recent years. However, most existing policies are trained on large-scale datasets collected in curated lab settings. If we aim to deploy these models in unstructured visual environments like people's homes, they will be unable to cope with the mismatch in data distribution. In such light, we present the first systematic effort in collecting a large dataset for robotic grasping in homes. First, to scale and parallelize data collection, we built a low cost mobile manipulator assembled for under 3K USD.