annotation model
VGR: Visual Grounded Reasoning
Wang, Jiacong, Kang, Zijian, Wang, Haochen, Jiang, Haiyong, Li, Jiawen, Wu, Bohong, Wang, Ya, Ran, Jiao, Liang, Xiao, Feng, Chao, Xiao, Jun
In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.
Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning
Ohnaka, Hien, Shirahata, Yuma, Park, Byeongseon, Yamamoto, Ryuichi
We propose a model to obtain phonemic and prosodic labels of speech that are coherent with graphemes. Unlike previous methods that simply fine-tune a pre-trained ASR model with the labels, the proposed model conditions the label generation on corresponding graphemes by two methods: 1) Add implicit grapheme conditioning through prompt encoder using pre-trained BERT features. 2) Explicitly prune the label hypotheses inconsistent with the grapheme during inference. These methods enable obtaining parallel data of speech, the labels, and graphemes, which is applicable to various downstream tasks such as text-to-speech and accent estimation from text. Experiments showed that the proposed method significantly improved the consistency between graphemes and the predicted labels. Further, experiments on accent estimation task confirmed that the created parallel data by the proposed method effectively improve the estimation accuracy.
DarkBench: Benchmarking Dark Patterns in Large Language Models
Kran, Esben, Nguyen, Hieu Minh "Jord", Kundu, Akash, Jawhar, Sami, Park, Jinsuk, Jurewicz, Mateusz Maria
Measuring these dark patterns is essential for understanding and mitigating the potential manipulative behaviors of LLMs. While some patterns, like Brand Bias and User Retention, were adapted directly from known dark patterns in UI/UX, others, like Harmful Generation and Anthropomorphization, represent critical risks not explicitly addressed in Brignull and Darlo (2010)'s taxonomy. Table 4 demonstrates how these categories map to or expand on established dark patterns, providing a foundation for their inclusion. However, some risks, particularly Anthropomorphization and Harmful Generation, require additional justification. Anthropomorphization, the attribution of human-like characteristics to AI systems, has been identified as a key factor in enhancing user engagement and trust.
Estimating Agreement by Chance for Sequence Annotation
Li, Diya, Rosรฉ, Carolyn, Yuan, Ao, Zhou, Chunxiao
In the field of natural language processing, correction of performance assessment for chance agreement plays a crucial role in evaluating the reliability of annotations. However, there is a notable dearth of research focusing on chance correction for assessing the reliability of sequence annotation tasks, despite their widespread prevalence in the field. To address this gap, this paper introduces a novel model for generating random annotations, which serves as the foundation for estimating chance agreement in sequence annotation tasks. Utilizing the proposed randomization model and a related comparison approach, we successfully derive the analytical form of the distribution, enabling the computation of the probable location of each annotated text segment and subsequent chance agreement estimation. Through a combination simulation and corpus-based evaluation, we successfully assess its applicability and validate its accuracy and efficacy.
ARAIDA: Analogical Reasoning-Augmented Interactive Data Annotation
Huang, Chen, Jin, Yiping, Ilievski, Ilija, Lei, Wenqiang, Lv, Jiancheng
Human annotation is a time-consuming task that requires a significant amount of effort. To address this issue, interactive data annotation utilizes an annotation model to provide suggestions for humans to approve or correct. However, annotation models trained with limited labeled data are prone to generating incorrect suggestions, leading to extra human correction effort. To tackle this challenge, we propose Araida, an analogical reasoning-based approach that enhances automatic annotation accuracy in the interactive data annotation setting and reduces the need for human corrections. Araida involves an error-aware integration strategy that dynamically coordinates an annotation model and a k-nearest neighbors (KNN) model, giving more importance to KNN's predictions when predictions from the annotation model are deemed inaccurate. Empirical studies demonstrate that Araida is adaptable to different annotation tasks and models. On average, it reduces human correction labor by 11.02% compared to vanilla interactive data annotation methods.
Wakeword Detection under Distribution Shifts
Parthasarathi, Sree Hari Krishnan, Zeng, Lu, Jose, Christin, Wang, Joseph
We propose a novel approach for semi-supervised learning (SSL) designed to overcome distribution shifts between training and real-world data arising in the keyword spotting (KWS) task. Shifts from training data distribution are a key challenge for real-world KWS tasks: when a new model is deployed on device, the gating of the accepted data undergoes a shift in distribution, making the problem of timely updates via subsequent deployments hard. Despite the shift, we assume that the marginal distributions on labels do not change. We utilize a modified teacher/student training framework, where labeled training data is augmented with unlabeled data. Note that the teacher does not have access to the new distribution as well. To train effectively with a mix of human and teacher labeled data, we develop a teacher labeling strategy based on confidence heuristics to reduce entropy on the label distribution from the teacher model; the data is then sampled to match the marginal distribution on the labels. Large scale experimental results show that a convolutional neural network (CNN) trained on far-field audio, and evaluated on far-field audio drawn from a different distribution, obtains a 14.3% relative improvement in false discovery rate (FDR) at equal false reject rate (FRR), while yielding a 5% improvement in FDR under no distribution shift. Under a more severe distribution shift from far-field to near-field audio with a smaller fully connected network (FCN) our approach achieves a 52% relative improvement in FDR at equal FRR, while yielding a 20% relative improvement in FDR on the original distribution.
Joint Multi-Dimensional Model for Global and Time-Series Annotations
Ramakrishna, Anil, Gupta, Rahul, Narayanan, Shrikanth
Crowdsourcing is a popular approach to collect annotations for unlabeled data instances. It involves collecting a large number of annotations from several, often naive untrained annotators for each data instance which are then combined to estimate the ground truth. Further, annotations for constructs such as affect are often multi-dimensional with annotators rating multiple dimensions, such as valence and arousal, for each instance. Most annotation fusion schemes however ignore this aspect and model each dimension separately. In this work we address this by proposing a generative model for multi-dimensional annotation fusion, which models the dimensions jointly leading to more accurate ground truth estimates. The model we propose is applicable to both global and time series annotation fusion problems and treats the ground truth as a latent variable distorted by the annotators. The model parameters are estimated using the Expectation-Maximization algorithm and we evaluate its performance using synthetic data and real emotion corpora as well as on an artificial task with human annotations
GPM: A Generic Probabilistic Model to Recover Annotator's Behavior and Ground Truth Labeling
Li, Jing, Ling, Suiyi, Wang, Junle, Li, Zhi, Callet, Patrick Le
In the big data era, data labeling can be obtained through crowdsourcing. Nevertheless, the obtained labels are generally noisy, unreliable or even adversarial. In this paper, we propose a probabilistic graphical annotation model to infer the underlying ground truth and annotator's behavior. To accommodate both discrete and continuous application scenarios (e.g., classifying scenes vs. rating videos on a Likert scale), the underlying ground truth is considered following a distribution rather than a single value. In this way, the reliable but potentially divergent opinions from "good" annotators can be recovered. The proposed model is able to identify whether an annotator has worked diligently towards the task during the labeling procedure, which could be used for further selection of qualified annotators. Our model has been tested on both simulated data and real-world data, where it always shows superior performance than the other state-of-the-art models in terms of accuracy and robustness.
Kill Two Birds With One Stone: Weakly-Supervised Neural Network for Image Annotation and Tag Refinement
Zhang, Junjie (University of Technology Sydney) | Wu, Qi (University of Adelaide) | Zhang, Jian (University of Technology Sydney) | Shen, Chunhua (University of Adelaide) | Lu, Jianfeng (Nanjing University of Science and Technology)
The number of social images has exploded by the wide adoption of social networks, and people like to share their comments about them. These comments can be a description of the image, or some objects, attributes, scenes in it, which are normally used as the user-provided tags. However, it is well-known that user-provided tags are incomplete and imprecise to some extent. Directly using them can damage the performance of related applications, such as the image annotation and retrieval. In this paper, we propose to learn an image annotation model and refine the user-provided tags simultaneously in a weakly-supervised manner. The deep neural network is utilized as the image feature learning and backbone annotation model, while visual consistency, semantic dependency, and user-error sparsity are introduced as the constraints at the batch level to alleviate the tag noise. Therefore, our model is highly flexible and stable to handle large-scale image sets. Experimental results on two benchmark datasets indicate that our proposed model achieves the best performance compared to the state-of-the-art methods.