AITopics

Technology:

Information Technology > Data Science > Data Quality (0.37)
Information Technology > Artificial Intelligence (0.37)

Neural Information Processing SystemsFeb-17-2026, 06:30:11 GMT

a7ce9b6a4db012cdaac28dd48989a17d-Paper-Datasets_and_Benchmarks_Track.pdf

data mining, machine learning, natural language, (21 more...)

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Switzerland > Basel-City > Basel (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area > Dermatology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Nuclear Medicine (0.67)
Information Technology (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
(7 more...)

Neural Information Processing SystemsFeb-15-2026, 17:19:50 GMT

886ed40d7882c9f891824e42a452c228-Paper-Conference.pdf

artificial intelligence, machine learning, natural language, (16 more...)

Country: Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation (0.93)
Media > Film (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsDec-26-2025, 07:03:18 GMT

Neural Relation Graph: A Unified Framework for Identifying Label Noise and Outlier Data

Diagnosing and cleaning data is a crucial step for building robust machine learning systems. However, identifying problems within large-scale datasets with real-world distributions is challenging due to the presence of complex issues such as label errors, under-representation, and outliers. In this paper, we propose a unified approach for identifying the problematic data by utilizing a largely ignored source of information: a relational structure of data in the feature-embedded space. To this end, we present scalable and effective algorithms for detecting label errors and outlier data based on the relational graph structure of data. We further introduce a visualization tool that provides contextual information of a data point in the feature-embedded space, serving as an effective tool for interactively diagnosing data. We evaluate the label error and outlier/out-of-distribution (OOD) detection performances of our approach on the large-scale image, speech, and language domain tasks, including ImageNet, ESC-50, and SST2. Our approach achieves state-of-the-art detection performance on all tasks considered and demonstrates its effectiveness in debugging large-scale real-world datasets across various domains.

identifying label noise, neural relation graph, unified framework, (8 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Pellegrino, Nicholas, Szczecina, David, Fieguth, Paul

Hard Samples, Bad Labels: Robust Loss Functions That Know When to Back Off

arXiv.org Artificial IntelligenceNov-27-2025

Incorrectly labelled training data are frustratingly ubiquitous in both benchmark and specially curated datasets. Such mislabelling clearly adversely affects the performance and generalizability of models trained through supervised learning on the associated datasets. Frameworks for detecting label errors typically require well-trained / well-generalized models; however, at the same time most frameworks rely on training these models on corrupt data, which clearly has the effect of reducing model generalizability and subsequent effectiveness in error detection -- unless a training scheme robust to label errors is employed. We evaluate two novel loss functions, Blurry Loss and Piecewise-zero Loss, that enhance robustness to label errors by de-weighting or disregarding difficult-to-classify samples, which are likely to be erroneous. These loss functions leverage the idea that mislabelled examples are typically more difficult to classify and should contribute less to the learning signal. Comprehensive experiments on a variety of artificially corrupted datasets demonstrate that the proposed loss functions outperform state-of-the-art robust loss functions in nearly all cases, achieving superior F1 scores for error detection. Further analyses through ablation studies offer insights to confirm these loss functions' broad applicability to cases of both uniform and non-uniform corruption, and with different label error detection frameworks. By using these robust loss functions, machine learning practitioners can more effectively identify, prune, or correct errors in their training data.

artificial intelligence, loss function, machine learning, (15 more...)

2511.16512

Country: North America > Canada (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Szczecina, David, Pellegrino, Nicholas, Fieguth, Paul

Pre-train to Gain: Robust Learning Without Clean Labels

arXiv.org Artificial IntelligenceNov-27-2025

Training deep networks with noisy labels leads to poor generalization and degraded accuracy due to overfitting to label noise. Existing approaches for learning with noisy labels often rely on the availability of a clean subset of data. By pre-training a feature extractor backbone without labels using self-supervised learning (SSL), followed by standard supervised training on the noisy dataset, we can train a more noise robust model without requiring a subset with clean labels. We evaluate the use of SimCLR and Barlow~Twins as SSL methods on CIFAR-10 and CIFAR-100 under synthetic and real world noise. Across all noise rates, self-supervised pre-training consistently improves classification accuracy and enhances downstream label-error detection (F1 and Balanced Accuracy). The performance gap widens as the noise rate increases, demonstrating improved robustness. Notably, our approach achieves comparable results to ImageNet pre-trained models at low noise levels, while substantially outperforming them under high noise conditions.

artificial intelligence, dataset, machine learning, (14 more...)

2511.20844

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Pellegrino, Nicholas, Szczecina, David, Fieguth, Paul W.

Effects of Initialization Biases on Deep Neural Network Training Dynamics

arXiv.org Artificial IntelligenceNov-27-2025

Untrained large neural networks, just after random initialization, tend to favour a small subset of classes, assigning high predicted probabilities to these few classes and approximately zero probability to all others. This bias, termed Initial Guessing Bias, affects the early training dynamics, when the model is fitting to the coarse structure of the data. The choice of loss function against which to train the model has a large impact on how these early dynamics play out. Two recent loss functions, Blurry and Piecewise-zero loss, were designed for robustness to label errors but can become unable to steer the direction of training when exposed to this initial bias. Results indicate that the choice of loss function has a dramatic effect on the early phase training of networks, and highlights the need for careful consideration of how Initial Guessing Bias may interact with various components of the training scheme.

artificial intelligence, loss function, machine learning, (16 more...)

2511.20826

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Neural Information Processing SystemsOct-10-2025, 12:33:48 GMT

Intrinsic Self-Supervision for Data Quality Audits

Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors, leading to inaccurate estimates of model performance.

data quality issue, dataset, duplicate, (14 more...)

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Switzerland > Basel-City > Basel (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area > Dermatology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Nuclear Medicine (0.67)
Information Technology (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
(7 more...)

Neural Information Processing SystemsOct-9-2025, 00:39:23 GMT

886ed40d7882c9f891824e42a452c228-Paper-Conference.pdf

artificial intelligence, machine learning, natural language, (16 more...)

Country: Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation (0.93)
Media > Film (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Penquitt, Sarina, Riedlinger, Tobias, Heller, Timo, Reischl, Markus, Rottmann, Matthias

Learning to Detect Label Errors by Making Them: A Method for Segmentation and Object Detection Datasets

arXiv.org Artificial IntelligenceAug-26-2025

--Recently, detection of label errors and improvement of label quality in datasets for supervised learning tasks has become an increasingly important goal in both research and industry. The consequences of incorrectly annotated data include reduced model performance, biased benchmark results, and lower overall accuracy. Current state-of-the-art label error detection methods often focus on a single computer vision task and, consequently, a specific type of dataset, containing, for example, either bounding boxes or pixel-wise annotations. Furthermore, previous methods are not learning-based. In this work, we overcome this research gap. We present a unified method for detecting label errors in object detection, semantic segmentation, and instance segmentation datasets. In a nutshell, our approach - learning to detect label errors by making them - works as follows: we inject different kinds of label errors into the ground truth. Then, the detection of label errors, across all mentioned primary tasks, is framed as an instance segmentation problem based on a composite input. In our experiments, we compare the label error detection performance of our method with various baselines and state-of-the-art approaches of each task's domain on simulated label errors across multiple tasks, datasets, and base models. This is complemented by a generalization study on real-world label errors. Additionally, we release 459 real label errors identified in the Cityscapes dataset and provide a benchmark for real label error detection in Cityscapes. Deep learning thrives on data: the more complex the task, the more data is required. In computer vision, larger training datasets consistently improve model performance [1], driving demand for large-scale, high-quality annotations.

artificial intelligence, label error, machine learning, (15 more...)

2508.1793

Country: Europe > Germany (0.68)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)