Goto

Collaborating Authors

 concept prompt


CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection

Chen, Qibo, Jin, Weizhong, Ge, Jianyue, Liu, Mengdi, Yan, Yuchao, Jiang, Jian, Yu, Li, Guo, Xuanjiang, Li, Shuchang, Chen, Jianzhong

arXiv.org Artificial Intelligence

Recent research on universal object detection aims to introduce language in a SoTA closed-set detector and then generalize the open-set concepts by constructing large-scale (text-region) datasets for training. However, these methods face two main challenges: (i) how to efficiently use the prior information in the prompts to genericise objects and (ii) how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training. To address these challenges, we propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios, with only one pre-training weight. Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scale-by-scale and multi-scale fusion modules. Then, the hybrid encoder is facilitated to fully utilize the prompted information by prompt multi-label loss and auxiliary detection head. In addition to text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt, to extract abstract concepts through concrete visual examples and stably reduce alignment bias in downstream tasks. With these effective designs, CP-DETR demonstrates superior universal detection performance in a broad spectrum of scenarios. For example, our Swin-T backbone model achieves 47.6 zero-shot AP on LVIS, and the Swin-L backbone model achieves 32.2 zero-shot AP on ODinW35. Furthermore, our visual prompt generation method achieves 68.4 AP on COCO val by interactive detection, and the optimized prompt achieves 73.1 fully-shot AP on ODinW13.


LLM-Measure: Generating Valid, Consistent, and Reproducible Text-Based Measures for Social Science Research

Yang, Yi, Duan, Hanyu, Liu, Jiaxin, Tam, Kar Yan

arXiv.org Artificial Intelligence

The increasing use of text as data in social science research necessitates the development of valid, consistent, reproducible, and efficient methods for generating text-based concept measures. This paper presents a novel method that leverages the internal hidden states of large language models (LLMs) to generate these concept measures. Specifically, the proposed method learns a concept vector that captures how the LLM internally represents the target concept, then estimates the concept value for text data by projecting the text's LLM hidden states onto the concept vector. Three replication studies demonstrate the method's effectiveness in producing highly valid, consistent, and reproducible text-based measures across various social science research contexts, highlighting its potential as a valuable tool for the research community.


Efficient Detection of Toxic Prompts in Large Language Models

Liu, Yi, Yu, Junzhe, Sun, Huijia, Shi, Ling, Deng, Gelei, Chen, Yuqi, Liu, Yang

arXiv.org Artificial Intelligence

Large language models (LLMs) like ChatGPT and Gemini have significantly advanced natural language processing, enabling various applications such as chatbots and automated content generation. However, these models can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses. These individuals often employ jailbreaking techniques to bypass safety mechanisms, highlighting the need for robust toxic prompt detection methods. Existing detection techniques, both blackbox and whitebox, face challenges related to the diversity of toxic prompts, scalability, and computational efficiency. In response, we propose ToxicDetector, a lightweight greybox method designed to efficiently detect toxic prompts in LLMs. ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding vectors to form feature vectors, and employs a Multi-Layer Perceptron (MLP) classifier for prompt classification. Our evaluation on various versions of the LLama models, Gemma-2, and multiple datasets demonstrates that ToxicDetector achieves a high accuracy of 96.39\% and a low false positive rate of 2.00\%, outperforming state-of-the-art methods. Additionally, ToxicDetector's processing time of 0.0780 seconds per prompt makes it highly suitable for real-time applications. ToxicDetector achieves high accuracy, efficiency, and scalability, making it a practical method for toxic prompt detection in LLMs.


MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Yang, Bang, Liu, Fenglin, Wu, Xian, Wang, Yaowei, Sun, Xu, Zou, Yuexian

arXiv.org Artificial Intelligence

Supervised visual captioning models typically require a large scale of images or videos paired with descriptions in a specific language (i.e., the vision-caption pairs) for training. However, collecting and labeling large-scale datasets is time-consuming and expensive for many scenarios and languages. Therefore, sufficient labeled pairs are usually not available. To deal with the label shortage problem, we present a simple yet effective zero-shot approach MultiCapCLIP that can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets. In the training stage, MultiCapCLIP only requires text data for input. Then it conducts two main steps: 1) retrieving concept prompts that preserve the corresponding domain knowledge of new scenarios; 2) auto-encoding the prompts to learn writing styles to output captions in a desired language. In the testing stage, MultiCapCLIP instead takes visual data as input directly to retrieve the concept prompts to generate the final visual descriptions. The extensive experiments on image and video captioning across four benchmarks and four languages (i.e., English, Chinese, German, and French) confirm the effectiveness of our approach. Compared with state-of-the-art zero-shot and weakly-supervised methods, our method achieves 4.8% and 21.5% absolute improvements in terms of BLEU@4 and CIDEr metrics. Our code is available at https://github.com/yangbang18/MultiCapCLIP.