annotation task
- Asia > India (0.05)
- South America > Brazil (0.04)
- Africa > Ghana (0.04)
- (7 more...)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.67)
- Health & Medicine > Therapeutic Area (0.69)
- Information Technology (0.67)
- Government > Regional Government (0.67)
- Media > Photography (0.48)
Tracing How Annotators Think: Augmenting Preference Judgments with Reading Processes
de Langis, Karin, Walker, William, Le, Khanh Chi, Kang, Dongyeop
We propose an annotation approach that captures not only labels but also the reading process underlying annotators' decisions, e.g., what parts of the text they focus on, re-read or skim. Using this framework, we conduct a case study on the preference annotation task, creating a dataset PreferRead that contains fine-grained annotator reading behaviors obtained from mouse tracking. PreferRead enables detailed analysis of how annotators navigate between a prompt and two candidate responses before selecting their preference. We find that annotators re-read a response in roughly half of all trials, most often revisiting the option they ultimately choose, and rarely revisit the prompt. Reading behaviors are also significantly related to annotation outcomes: re-reading is associated with higher inter-annotator agreement, whereas long reading paths and times are associated with lower agreement. These results demonstrate that reading processes provide a complementary cognitive dimension for understanding annotator reliability, decision-making and disagreement in complex, subjective NLP tasks. Our code and data are publicly available.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Minnesota (0.04)
AI-Boosted Video Annotation: Assessing the Process Enhancement
Gutiérrez, Juan, Mora, Ángel, Regodón, Pablo, Rodriguez, Silvia, Blanco, José Luis
We explore the enhancement of Human-in-the-Loop video annotation by integrating automatic capabilities to ease the task for annotators and assess their performance. The research delves into the practical implications of the annotation processes, the integration of AI components, and the evaluation of its outcomes. We analyze their impact on efficiency, accuracy, and overall annotation quality. Focusing on the Human-in-the-Loop for video annotation tasks, we implemented a single-iteration scheme using Label Studio and AI-powered zero-shot pre-annotations. Using this framework, we designed a test based on the annotation of the UCF-Crime dataset to discriminate between normal and abnormal activities in video footage. Our results evidence how automatic AI-based pre-annotation can streamline the video annotation workflow, empowering human annotators and optimizing the overall pipeline. Using the pre-annotated data, we observed a 35% reduction in the annotation time for 70% of the annotators with similar quality annotations, compared to the traditional manual annotation task. Results are consistent with asset duration and complexity. We also observed that while annotators rapidly learned to use the tool, the produced annotations are more coherent among annotators and better match the natural clustering of the video frames.
- Research Report > New Finding (0.66)
- Research Report > Experimental Study (0.46)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
MAFA: A Multi-Agent Framework for Enterprise-Scale Annotation with Configurable Task Adaptation
Hegazy, Mahmood, Rodrigues, Aaron, Naeem, Azzam
We present MAFA (Multi-Agent Framework for Annotation), a production-deployed system that transforms enterprise-scale annotation workflows through configurable multi-agent collaboration. Addressing the critical challenge of annotation backlogs in financial services, where millions of customer utterances require accurate categorization, MAFA combines specialized agents with structured reasoning and a judge-based consensus mechanism. Our framework uniquely supports dynamic task adaptation, allowing organizations to define custom annotation types (FAQs, intents, entities, or domain-specific categories) through configuration rather than code changes. Deployed at JP Morgan Chase, MAFA has eliminated a 1 million utterance backlog while achieving, on average, 86% agreement with human annotators, annually saving over 5,000 hours of manual annotation work. The system processes utterances with annotation confidence classifications, which are typically 85% high, 10% medium, and 5% low across all datasets we tested. This enables human annotators to focus exclusively on ambiguous and low-coverage cases. We demonstrate MAFA's effectiveness across multiple datasets and languages, showing consistent improvements over traditional and single-agent annotation baselines: 13.8% higher Top-1 accuracy, 15.1% improvement in Top-5 accuracy, and 16.9% better F1 in our internal intent classification dataset and similar gains on public benchmarks.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
- North America > Dominican Republic (0.04)
- Asia > China > Hong Kong (0.04)
- Information Technology > Security & Privacy (1.00)
- Banking & Finance (1.00)
Repurposing Annotation Guidelines to Instruct LLM Annotators: A Case Study
Kim, Kon Woo, Islamaj, Rezarta, Kim, Jin-Dong, Boudin, Florian, Aizawa, Akiko
This case study explores the potential of repurposing existing annotation guidelines to instruct a large language model (LLM) annotator in text annotation tasks. Traditional annotation projects invest significant resources--both time and cost--in developing comprehensive annotation guidelines. These are primarily designed for human annotators who will undergo training sessions to check and correct their understanding of the guidelines. While the results of the training are internalized in the human annotators, LLMs require the training content to be materialized. Thus, we introduce a method called moderation-oriented guideline repurposing, which adapts annotation guidelines to provide clear and explicit instructions through a process called LLM moderation. Using the NCBI Disease Corpus and its detailed guidelines, our experimental results demonstrate that, despite several remaining challenges, repurposing the guidelines can effectively guide LLM annotators. Our findings highlight both the promising potential and the limitations of leveraging the proposed workflow in automated settings, offering a new direction for a scalable and cost-effective refinement of annotation guidelines and the following annotation process.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > France > Pays de la Loire > Loire-Atlantique > Nantes (0.04)
- (5 more...)
- Asia > India (0.05)
- South America > Brazil (0.04)
- Africa > Ghana (0.04)
- (7 more...)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.67)
- Health & Medicine > Therapeutic Area (0.69)
- Information Technology (0.67)
- Government > Regional Government (0.67)
- Media > Photography (0.48)
Quantifying Ambiguity in Categorical Annotations: A Measure and Statistical Inference Framework
Klugmann, Christopher, Kondermann, Daniel
Human-generated categorical annotations frequently produce empirical response distributions (soft labels) that reflect ambiguity rather than simple annotator error. We introduce an ambiguity measure that maps a discrete response distribution to a scalar in the unit interval, designed to quantify aleatoric uncertainty in categorical tasks. The measure bears a close relationship to quadratic entropy (Gini-style impurity) but departs from those indices by treating an explicit "can't solve" category asymmetrically, thereby separating uncertainty arising from class-level indistinguishability from uncertainty due to explicit unresolvability. We analyze the measure's formal properties and contrast its behavior with a representative ambiguity measure from the literature. Moving beyond description, we develop statistical tools for inference: we propose frequentist point estimators for population ambiguity and derive the Bayesian posterior over ambiguity induced by Dirichlet priors on the underlying probability vector, providing a principled account of epistemic uncertainty. Numerical examples illustrate estimation, calibration, and practical use for dataset-quality assessment and downstream machine-learning workflows.
- Europe > Italy (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation
Baumann, Joachim, Röttger, Paul, Urman, Aleksandra, Wendsjö, Albert, Plaza-del-Arco, Flor Miriam, Gruber, Johannes B., Hovy, Dirk
Large language models are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection or prompting strategy). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors. We call this phenomenon where configuration choices lead to incorrect conclusions LLM hacking. We find that intentional LLM hacking is strikingly simple. By replicating 37 data annotation tasks from 21 published social science studies, we show that, with just a handful of prompt paraphrases, virtually anything can be presented as statistically significant. Beyond intentional manipulation, our analysis of 13 million labels from 18 different LLMs across 2361 realistic hypotheses shows that there is also a high risk of accidental LLM hacking, even when following standard research practices. We find incorrect conclusions in approximately 31% of hypotheses for state-of-the-art LLMs, and in half the hypotheses for smaller language models. While higher task performance and stronger general model capabilities reduce LLM hacking risk, even highly accurate models remain susceptible. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of LLM-based findings near significance thresholds. We analyze 21 mitigation techniques and find that human annotations provide crucial protection against false positives. Common regression estimator correction techniques can restore valid inference but trade off Type I vs. Type II errors. We publish a list of practical recommendations to prevent LLM hacking.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
- (22 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Media > News (1.00)
- Health & Medicine (1.00)
- Education (1.00)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
IntrEx: A Dataset for Modeling Engagement in Educational Conversations
Tan, Xingwei, Parvatham, Mahathi, Gambi, Chiara, Pergola, Gabriele
Engagement and motivation are crucial for second-language acquisition, yet maintaining learner interest in educational conversations remains a challenge. While prior research has explored what makes educational texts interesting, still little is known about the linguistic features that drive engagement in conversations. To address this gap, we introduce IntrEx, the first large dataset annotated for interestingness and expected interestingness in teacher-student interactions. Built upon the Teacher-Student Chatroom Corpus (TSCC), IntrEx extends prior work by incorporating sequence-level annotations, allowing for the study of engagement beyond isolated turns to capture how interest evolves over extended dialogues. We employ a rigorous annotation process with over 100 second-language learners, using a comparison-based rating approach inspired by reinforcement learning from human feedback (RLHF) to improve agreement. We investigate whether large language models (LLMs) can predict human interestingness judgments. We find that LLMs (7B/8B parameters) fine-tuned on interestingness ratings outperform larger proprietary models like GPT-4o, demonstrating the potential for specialised datasets to model engagement in educational settings. Finally, we analyze how linguistic and cognitive factors, such as concreteness, comprehensibility (readability), and uptake, influence engagement in educational dialogues.
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- (8 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
CrowdAgent: Multi-Agent Managed Multi-Source Annotation System
Qin, Maosheng, Zhu, Renyu, Xia, Mingxuan, Chen, Chenkai, Zhu, Zhen, Lin, Minmin, Zhao, Junbo, Xu, Lu, Fan, Changjie, Wu, Runze, Wang, Haobo
High-quality annotated data is a cornerstone of modern Natural Language Processing (NLP). While recent methods begin to leverage diverse annotation sources-including Large Language Models (LLMs), Small Language Models (SLMs), and human experts-they often focus narrowly on the labeling step itself. A critical gap remains in the holistic process control required to manage these sources dynamically, addressing complex scheduling and quality-cost trade-offs in a unified manner. Inspired by real-world crowdsourcing companies, we introduce CrowdAgent, a multi-agent system that provides end-to-end process control by integrating task assignment, data annotation, and quality/cost management. It implements a novel methodology that rationally assigns tasks, enabling LLMs, SLMs, and human experts to advance synergistically in a collaborative annotation workflow. We demonstrate the effectiveness of CrowdAgent through extensive experiments on six diverse multimodal classification tasks. The source code and video demo are available at https://github.com/QMMMS/CrowdAgent.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Singapore (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (11 more...)
- Research Report (1.00)
- Overview (0.68)
- Workflow (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- (2 more...)