Goto

Collaborating Authors

 cleanlab




Towards Cross-Modal Error Detection with Tables and Images

arXiv.org Artificial Intelligence

Ensuring data quality at scale remains a persistent challenge for large organizations. Despite recent advances, maintaining accurate and consistent data is still complex, especially when dealing with multiple data modalities. Traditional error detection and correction methods tend to focus on a single modality, typically a table, and often miss cross-modal errors that are common in domains like e-Commerce and healthcare, where image, tabular, and text data co-exist. To address this gap, we take an initial step towards cross-modal error detection in tabular data, by benchmarking several methods. Our evaluation spans four datasets and five baseline approaches. Among them, Cleanlab, a label error detection framework, and DataScope, a data valuation method, perform the best when paired with a strong AutoML framework, achieving the highest F1 scores. Our findings indicate that current methods remain limited, particularly when applied to heavy-tailed real-world data, motivating further research in this area.



Supplement: Robustness to Label Noise Depends on the Shape of the Noise Distribution

Neural Information Processing Systems

Do the main claims made in the abstract and introduction accurately reflect the paper's Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Y es] Instructions are We will provide code after internal review for release. Did you specify all the training details (e.g., data splits, hyperparameters, how they Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Did you include the total amount of compute and the type of resources used (e.g., type Did you include any new assets either in the supplemental material or as a URL? [N/A] Did you discuss whether and how consent was obtained from people whose data you're If you used crowdsourcing or conducted research with human subjects... (a) Proofs provided for theoretical results of Section 3. A.1 Uniform noise Proof. Lemma 3.2: Let c, ϵ, m (c 1) ( c 1) ( c 1) (c 1) (c 1) (c 1) (c 1) (c 1) ( c 1) Lemma 3.5: Let c, ϵ, m The proof of Theorem 3.6 is identical to that of Theorem 3.3 except using the value of Fig. S3 shows the same results as Fig. S3, but with the accuracy results of the vanilla (no label noise Fig. S4 compares the clean test accuracy on 10-class, 5-dimensional synthetic data of two label-noise Each of the methods is run with default parameters found in the corresponding repositories. All of our experiments utilize the ResNet-32 architecture across all mitigation methods.


Chatbot answers are all made up. This new tool helps you figure out which ones to trust.

MIT Technology Review

Cleanlab hopes that its tool will make large language models more attractive to businesses worried about how much stuff they invent. "I think people know LLMs will change the world, but they've just got hung up on the damn hallucinations," says Cleanlab CEO Curtis Northcutt. Chatbots are quickly becoming the dominant way people look up information on a computer. Search engines are being redesigned around the technology. Office software used by billions of people every day to create everything from school assignments to marketing copy to financial reports now comes with chatbots built in.


Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AI

arXiv.org Artificial Intelligence

Characterizing samples that are difficult to learn from is crucial to developing highly performant ML models. This has led to numerous Hardness Characterization Methods (HCMs) that aim to identify "hard" samples. However, there is a lack of consensus regarding the definition and evaluation of "hardness". Unfortunately, current HCMs have only been evaluated on specific types of hardness and often only qualitatively or with respect to downstream performance, overlooking the fundamental quantitative identification task. We address this gap by presenting a fine-grained taxonomy of hardness types. Additionally, we propose the Hardness Characterization Analysis Toolkit (H-CAT), which supports comprehensive and quantitative benchmarking of HCMs across the hardness taxonomy and can easily be extended to new HCMs, hardness types, and datasets. We use H-CAT to evaluate 13 different HCMs across 8 hardness types. This comprehensive evaluation encompassing over 14K setups uncovers strengths and weaknesses of different HCMs, leading to practical tips to guide HCM selection and future development. Our findings highlight the need for more comprehensive HCM evaluation, while we hope our hardness taxonomy and toolkit will advance the principled evaluation and uptake of data-centric AI methods.


FPR Estimation for Fraud Detection in the Presence of Class-Conditional Label Noise

arXiv.org Artificial Intelligence

We consider the problem of estimating the false-/ true-positive-rate (FPR/TPR) for a binary classification model when there are incorrect labels (label noise) in the validation set. Our motivating application is fraud prevention where accurate estimates of FPR are critical to preserving the experience for good customers, and where label noise is highly asymmetric. Existing methods seek to minimize the total error in the cleaning process - to avoid cleaning examples that are not noise, and to ensure cleaning of examples that are. This is an important measure of accuracy but insufficient to guarantee good estimates of the true FPR or TPR for a model, and we show that using the model to directly clean its own validation data leads to underestimates even if total error is low. This indicates a need for researchers to pursue methods that not only reduce total error but also seek to de-correlate cleaning error with model scores.


On Evaluation of Document Classification using RVL-CDIP

arXiv.org Artificial Intelligence

The RVL-CDIP benchmark is widely used for measuring performance on the task of document classification. Despite its widespread use, we reveal several undesirable characteristics of the RVL-CDIP benchmark. These include (1) substantial amounts of label noise, which we estimate to be 8.1% (ranging between 1.6% to 16.9% per document category); (2) presence of many ambiguous or multi-label documents; (3) a large overlap between test and train splits, which can inflate model performance metrics; and (4) presence of sensitive personally-identifiable information like US Social Security numbers (SSNs). We argue that there is a risk in using RVL-CDIP for benchmarking document classifiers, as its limited scope, presence of errors (state-of-the-art models now achieve accuracy error rates that are within our estimated label error rate), and lack of diversity make it less than ideal for benchmarking. We further advocate for the creation of a new document classification benchmark, and provide recommendations for what characteristics such a resource should include.


Cleanlab: Correct your data labels automatically and quickly – Towards AI

#artificialintelligence

Originally published on Towards AI. I used an open-sourced library, cleanlab, to remove low-quality labels on an image dataset. The model trained on the dataset without low-quality data gained 4 percentage points of accuracy compared to the baseline model (trained on all data). Improving data quality sounds easy enough. But the workload of manually checking data quality can quickly become insurmountable as the dataset scales.