Goto

Collaborating Authors

 annotation cost


in which the model keeps predicting the class of each unlabeled sample and learns from the feedback that whether 2

Neural Information Processing Systems

We thank all reviewers for their valuable comments. The reviewers are generally satisfied with our writing and experiments. Other concerns are minor - we carefully respond below and will revise the paper accordingly. Can one-bit supervision reduce annotation costs? According to the authors of ILSVRC2012 [Russakovsky et al., IJCV'15], the average time for a full-bit annotation is A1: Please refer to the common question.


Interactive Multi-fidelity Learning for Cost-effective Adaptation of Language Model with Sparse Human Supervision

Neural Information Processing Systems

Large language models (LLMs) have demonstrated remarkable capabilities in various tasks. However, their suitability for domain-specific tasks, is limited due to their immense scale at deployment, susceptibility to misinformation, and more importantly, high data annotation costs. We propose a novel Interactive Multi-Fidelity Learning (IMFL) framework for cost-effective development of small domain-specific LMs under limited annotation budgets. Our approach formulates the domain-specific fine-tuning process as a multi-fidelity learning problem, focusing on identifying the optimal acquisition strategy that balances between low-fidelity automatic LLM annotations and high-fidelity human annotations to maximize model performance. We further propose an exploration-exploitation query strategy that enhances annotation diversity and informativeness, incorporating two innovative designs: 1) prompt retrieval that selects in-context examples from human-annotated samples to improve LLM annotation, and 2) variable batch size that controls the order for choosing each fidelity to facilitate knowledge distillation, ultimately enhancing annotation quality. Extensive experiments on financial and medical tasks demonstrate that IMFL achieves superior performance compared with single fidelity annotations. Given a limited budget of human annotation, IMFL significantly outperforms the $\bf 3\times$ human annotation baselines in all four tasks and achieves very close performance as $\bf 5\times$ human annotation on two of the tasks. These promising results suggest that the high human annotation costs in domain-specific tasks can be significantly reduced by employing IMFL, which utilizes fewer human annotations, supplemented with cheaper and faster LLM (e.g., GPT-3.5)


Actively Testing Your Model While It Learns: Realizing Label-Efficient Learning in Practice

Neural Information Processing Systems

In active learning (AL), we focus on reducing the data annotation cost from the model training perspective. However, testing'', which often refers to the model evaluation process of using empirical risk to estimate the intractable true generalization risk, also requires data annotations. The annotation cost for testing'' (model evaluation) is under-explored. Even in works that study active model evaluation or active testing (AT), the learning and testing ends are disconnected. In this paper, we propose a novel active testing while learning (ATL) framework that integrates active learning with active testing.


Are all Frames Equal? Active Sparse Labeling for Video Action Detection

Neural Information Processing Systems

Video action detection requires annotations at every frame, which drastically increases the labeling cost. In this work, we focus on efficient labeling of videos for action detection to minimize this cost. We propose active sparse labeling (ASL), a novel active learning strategy for video action detection. Sparse labeling will reduce the annotation cost but poses two main challenges; 1) how to estimate the utility of annotating a single frame for action detection as detection is performed at video level?, and 2) how these sparse labels can be used for action detection which require annotations on all the frames? This work attempts to address these challenges within a simple active learning framework. For the first challenge, we propose a novel frame-level scoring mechanism aimed at selecting most informative frames in a video. Next, we introduce a novel loss formulation which enables training of action detection model with these sparsely selected frames. We evaluate the proposed approach on two different action detection benchmark datasets, UCF-101-24 and J-HMDB-21, and observed that active sparse labeling can be very effective in saving annotation costs. We demonstrate that the proposed approach performs better than random selection, outperforming all other baselines, with performance comparable to supervised approach using merely 10% annotations.


From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations

Guo, Zhiqing, Xi, Dongdong, Li, Songlin, Yang, Gaobo

arXiv.org Artificial Intelligence

Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. Existing fully-supervised IML methods depend heavily on dense pixel-level mask annotations, which limits scalability to large datasets or real-world deployment. In contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization. To address this dilemma, we propose BoxPromptIML, a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. Specifically, we propose a coarse region annotation strategy, which can generate relatively accurate manipulation masks at lower cost. To improve model efficiency and facilitate deployment, we further design an efficient lightweight student model, which learns to perform fine-grained localization through knowledge distillation from a fixed teacher model based on the Segment Anything Model (SAM). Moreover, inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled prototypical patterns with real-time observational cues derived from the input. Instead of passive feature extraction, this strategy enables a dynamic process of knowledge recollection, where long-term memory is adapted to the specific context of the current image, significantly enhancing localization accuracy and robustness. Extensive experiments across both in-distribution and out-of-distribution datasets show that Box-PromptIML outperforms or rivals fully-supervised models, while maintaining strong generalization, low annotation cost, and efficient deployment characteristics.



8fc687aa152e8199fe9e73304d407bca-AuthorFeedback.pdf

Neural Information Processing Systems

Below, we conduct a point-point response to each comment. We will discuss them in the revised manuscript. ECCV18] apply the SPL processes to refine the saliency maps obtained in co-saliency detection. As we mentioned in the paper, these works share a similar spirit with our work. As for the difference, Y an et al. [23] define their problem on sparsely annotated video frames. It requires to generate image data for different problem domains.



When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

Al-Hamadani, Samer

arXiv.org Artificial Intelligence

Object detection constitutes a foundational computer vision capability enabling diverse applications from autonomous vehicles to retail analytics, with modern deep learning approaches achieving remarkable technical performance exceeding 90% mean Average Precision on standardized benchmarks [1, 2]. However, technical accuracy represents only one dimension of deployment viability, as real-world system selection requires evaluating cost-effectiveness--the relationship between detection performance and total economic investment required to achieve that performance [3, 4]. Traditional supervised detectors, exemplified by the YOLO architecture family [2, 5], rely fundamentally on manually annotated training data, with industry reports estimating annotation costs between $0.10 and $0.50 per bounding box [6, 7], translating to $9,000-$45,000 for establishing 100-category detection systems with sufficient training data. Vision-Language Models represent an alternative paradigm achieving object detection through zero-shot inference without task-specific supervision [8-10]. Pre-trained on billions of image-text pairs, VLMs accept natural language object descriptions and generate bounding box predictions through learned visual-linguistic alignment, fundamentally eliminating annotation requirements.