annotation error
Appendix
We limit the target languages for this augmentation process to Arabic, Finnish, Japanese, Korean, Russian, Spanish, Swedish, Hebrew, Thai,Danish,French,Italian,Dutch,Polish,andPortuguese. Interestingly,justaddingthislanguage code effectively changes the outputs as shown in Table 7. We further subsample 50% of the synthetically generated questions. During inference, we first retrieve top 15 passages using mDPR, and then feed the questions andconcatenated passages intothemGEN model, withlanguage tags. The gray dots concentrated in the lower right part in the first figure represent encoded Thai embeddings.
- North America > United States > California (0.04)
- Europe > Finland > Uusimaa > Helsinki (0.04)
EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI
Zuo, Longfei, Plank, Barbara, Peng, Siyao
High-quality datasets are critical for training and evaluating reliable NLP models. In tasks like natural language inference (NLI), human label variation (HLV) arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation. An earlier framework VARIERR (Weber-Genzel et al., 2024) asks multiple annotators to explain their label decisions in the first round and flag errors via validity judgments in the second round. However, conducting two rounds of manual annotation is costly and may limit the coverage of plausible labels or explanations. Our study proposes a new framework, EVADE, for generating and validating explanations to detect errors using large language models (LLMs). We perform a comprehensive analysis comparing human- and LLM-detected errors for NLI across distribution comparison, validation overlap, and impact on model fine-tuning. Our experiments demonstrate that LLM validation refines generated explanation distributions to more closely align with human annotations, and that removing LLM-detected errors from training data yields improvements in fine-tuning performance than removing errors identified by human annotators. This highlights the potential to scale error detection, reducing human effort while improving dataset quality under label variation.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (11 more...)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers
Seo, Wooseok, Han, Seungju, Jung, Jaehun, Newman, Benjamin, Lim, Seungwon, Lee, Seungbeen, Lu, Ximing, Choi, Yejin, Yu, Youngjae
Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16\% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance. We therefore recommend future studies include comparisons with these simple yet highly effective baselines. Lastly, despite their effectiveness, frontier LLMs incur substantial costs, motivating the development of small, fine-tuned fact verifiers. We show that these small models still have room for improvement, particularly on instances that require complex reasoning. Encouragingly, we demonstrate that augmenting training with synthetic multi-hop reasoning data significantly enhances their capabilities in such instances. We release our code, model, and dataset at https://github.com/just1nseo/verifying-the-verifiers
- Asia > India (0.05)
- Asia > Pakistan > Punjab > Lahore Division > Lahore (0.04)
- Asia > Middle East > Jordan (0.04)
- (4 more...)
- Health & Medicine (1.00)
- Leisure & Entertainment > Sports > Motorsports (0.68)
- Leisure & Entertainment > Sports > Soccer (0.68)
RePOPE: Impact of Annotation Errors on the POPE Benchmark
Neuhaus, Yannic, Hein, Matthias
Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at https://github.com/YanNeu/RePOPE .
Common Ground, Diverse Roots: The Difficulty of Classifying Common Examples in Spanish Varieties
Lopetegui, Javier A., Riabi, Arij, Seddah, Djamé
Variations in languages across geographic regions or cultures are crucial to address to avoid biases in NLP systems designed for culturally sensitive tasks, such as hate speech detection or dialog with conversational agents. In languages such as Spanish, where varieties can significantly overlap, many examples can be valid across them, which we refer to as common examples. Ignoring these examples may cause misclassifications, reducing model accuracy and fairness. Therefore, accounting for these common examples is essential to improve the robustness and representativeness of NLP systems trained on such data. In this work, we address this problem in the context of Spanish varieties. We use training dynamics to automatically detect common examples or errors in existing Spanish datasets. We demonstrate the efficacy of using predicted label confidence for our Datamaps \cite{swayamdipta-etal-2020-dataset} implementation for the identification of hard-to-classify examples, especially common examples, enhancing model performance in variety identification tasks. Additionally, we introduce a Cuban Spanish Variety Identification dataset with common examples annotations developed to facilitate more accurate detection of Cuban and Caribbean Spanish varieties. To our knowledge, this is the first dataset focused on identifying the Cuban, or any other Caribbean, Spanish variety.
- North America > Cuba (0.06)
- South America > Argentina (0.05)
- Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
- (25 more...)
SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization
Tsuji, Kohei, Hiraoka, Tatsuya, Cheng, Yuchang, Iwakura, Tomoya
Such a method to weigh annotation errors is recently Various NLP tasks exploit the pair of the raw text studied in the NER field. Wang et al. (2019) and the annotation label for training and evaluating proposed CrossWeigh which is the method for detecting models. For example of named entity recognition annotation errors in the dataset and adjusting (NER), which is applied to various practical technologies their learning priority by weighting loss values such as location detection (Inkpen et al., so that the training is not affected by such annotation 2017) and anonymization (Mamede et al., 2016), errors. However, there are shortcomings some parts of the text are annotated as named entities in its computational efficiency, especially in the (e.g., location names or personal names). And recent NLP trends with the pre-trained large language then, a model is trained to extract these entities models. We consider that the more efficient from the raw text. To achieve higher performance methods of annotation weighing can speed up the in NLP tasks, the models should be trained or finetuned development of NLP. In addition, reducing the computational with a sophisticated training dataset without cost contributes to Green AI (Schwartz annotation errors.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Japan (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (10 more...)
Annotation Errors and NER: A Study with OntoNotes 5.0
Bernier-Colborne, Gabriel, Vajjala, Sowmya
Named Entity Recognition (NER) is a well-studied problem in NLP. However, there is much less focus on studying NER datasets, compared to developing new NER models. In this paper, we employed three simple techniques to detect annotation errors in the OntoNotes 5.0 corpus for English NER, which is the largest available NER corpus for English. Our techniques corrected ~10% of the sentences in train/dev/test data. In terms of entity mentions, we corrected the span and/or type of ~8% of mentions in the dataset, while adding/deleting/splitting/merging a few more. These are large numbers of changes, considering the size of OntoNotes. We used three NER libraries to train, evaluate and compare the models trained with the original and the re-annotated datasets, which showed an average improvement of 1.23% in overall F-scores, with large (>10%) improvements for some of the entity types. While our annotation error detection methods are not exhaustive and there is some manual annotation effort involved, they are largely language agnostic and can be employed with other NER datasets, and other sequence labelling tasks.
- Asia > China > Hong Kong (0.04)
- Oceania > Australia (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (12 more...)
VariErr NLI: Separating Annotation Error from Human Label Variation
Weber-Genzel, Leon, Peng, Siyao, de Marneffe, Marie-Catherine, Plank, Barbara
Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white. To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs. VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > Singapore (0.04)
- (13 more...)
Linguistically Conditioned Semantic Textual Similarity
Tu, Jingxuan, Xu, Keer, Yue, Liulu, Ye, Bingyang, Rim, Kyeongmin, Pustejovsky, James
Semantic textual similarity (STS) is a fundamental NLP task that measures the semantic similarity between a pair of sentences. In order to reduce the inherent ambiguity posed from the sentences, a recent work called Conditional STS (C-STS) has been proposed to measure the sentences' similarity conditioned on a certain aspect. Despite the popularity of C-STS, we find that the current C-STS dataset suffers from various issues that could impede proper evaluation on this task. In this paper, we reannotate the C-STS validation set and observe an annotator discrepancy on 55% of the instances resulting from the annotation errors in the original label, ill-defined conditions, and the lack of clarity in the task definition. After a thorough dataset analysis, we improve the C-STS task by leveraging the models' capability to understand the conditions under a QA task setting. With the generated answers, we present an automatic error identification pipeline that is able to identify annotation errors from the C-STS data with over 80% F1 score. We also propose a new method that largely improves the performance over baselines on the C-STS data by training the models with the answers. Finally we discuss the conditionality annotation based on the typed-feature structure (TFS) of entity types. We show in examples that the TFS is able to provide a linguistic foundation for constructing C-STS data with new conditions.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (12 more...)