chatgpt detector
Investigating the Influence of Prompt-Specific Shortcuts in AI Generated Text Detection
Park, Choonghyun, Kim, Hyuhng Joon, Kim, Junyeob, Kim, Youna, Kim, Taeuk, Cho, Hyunsoo, Jo, Hwiyeol, Lee, Sang-goo, Yoo, Kang Min
AI Generated Text (AIGT) detectors are developed with texts from humans and LLMs of common tasks. Despite the diversity of plausible prompt choices, these datasets are generally constructed with a limited number of prompts. The lack of prompt variation can introduce prompt-specific shortcut features that exist in data collected with the chosen prompt, but do not generalize to others. In this paper, we analyze the impact of such shortcuts in AIGT detection. We propose Feedback-based Adversarial Instruction List Optimization (FAILOpt), an attack that searches for instructions deceptive to AIGT detectors exploiting prompt-specific shortcuts. FAILOpt effectively drops the detection performance of the target detector, comparable to other attacks based on adversarial in-context examples. We also utilize our method to enhance the robustness of the detector by mitigating the shortcuts. Based on the findings, we further train the classifier with the dataset augmented by FAILOpt prompt. The augmented classifier exhibits improvements across generation models, tasks, and attacks. Our code will be available at https://github.com/zxcvvxcz/FAILOpt.
Evade ChatGPT Detectors via A Single Space
ChatGPT brings revolutionary social value but also raises concerns about the misuse of AI-generated text. Consequently, an important question is how to detect whether texts are generated by ChatGPT or by human. Existing detectors are built upon the assumption that there are distributional gaps between human-generated and AI-generated text. These gaps are typically identified using statistical information or classifiers. Our research challenges the distributional gap assumption in detectors. We find that detectors do not effectively discriminate the semantic and stylistic gaps between human-generated and AI-generated text. Instead, the "subtle differences", such as an extra space, become crucial for detection. Based on this discovery, we propose the SpaceInfi strategy to evade detection. Experiments demonstrate the effectiveness of this strategy across multiple benchmarks and detectors. We also provide a theoretical explanation for why SpaceInfi is successful in evading perplexity-based detection. And we empirically show that a phenomenon called token mutation causes the evasion for language model-based detectors. Our findings offer new insights and challenges for understanding and constructing more applicable ChatGPT detectors.
Towards a Robust Detection of Language Model Generated Text: Is ChatGPT that Easy to Detect?
Antoun, Wissam, Mouilleron, Virginie, Sagot, Benoรฎt, Seddah, Djamรฉ
Advances in natural language processing (NLP) have been driven mainly by scaling up the size of pre-trained language models, along with the amount of data and compute required for training (Raffel et al., 2020; Radford et al., 2019; Rae et al., 2021; Fedus et al., 2021; Hoffmann et al., 2022). OpenAI recently released ChatGPT, a text generation model with conversational capabilities. The model is based on GPT3.5 which is a version of GPT3 (Brown et al., 2020) first fine-tuned on code then fine-tuned using Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017; Stiennon et al., 2020), a method previously demonstrated by OpenAI with Instruct-GPT (Ouyang et al., 2022). This fine-tuning process contributes not only to the model's knowledge but also simplifies the model's interface compared to GPT3, which necessitated substantial prompt engineering to achieve satisfactory outcomes, and hence facilitating the extraction and application of that built-in knowledge. As a result of these significant performance improvements, ChatGPT and other large language models have gained much popularity in the media and in the social context, often without fully understanding the underlying limitations of the models - e.g., the possibility of generating hateful, hateful, toxic, or disrespectful content (Bender et al., 2021; McGuffie & Newhouse, 2020; Weidinger et al., 2021). Another potential misuse of LLMs or ChatGPT is industrializing radicalization and harmful propaganda which poses a significant and unconventional threat to civil society. In response to the mounting concerns surrounding potential misuse, numerous researchers are now exploring various strategies to mitigate associated risks.
MGTBench: Benchmarking Machine-Generated Text Detection
He, Xinlei, Shen, Xinyue, Chen, Zeyuan, Backes, Michael, Zhang, Yang
Nowadays large language models (LLMs) have shown revolutionary power in a variety of natural language processing (NLP) tasks such as text classification, sentiment analysis, language translation, and question-answering. In this way, detecting machine-generated texts (MGTs) is becoming increasingly important as LLMs become more advanced and prevalent. These models can generate human-like language that can be difficult to distinguish from text written by a human, which raises concerns about authenticity, accountability, and potential bias. However, existing detection methods against MGTs are evaluated under different model architectures, datasets, and experimental settings, resulting in a lack of a comprehensive evaluation framework across different methodologies In this paper, we fill this gap by proposing the first benchmark framework for MGT detection, named MGTBench. Extensive evaluations on public datasets with curated answers generated by ChatGPT (the most representative and powerful LLMs thus far) show that most of the current detection methods perform less satisfactorily against MGTs. An exceptional case is ChatGPT Detector, which is trained with ChatGPT-generated texts and shows great performance in detecting MGTs. Nonetheless, we note that only a small fraction of adversarial-crafted perturbations on MGTs can evade the ChatGPT Detector, thus highlighting the need for more robust MGT detection methods. We envision that MGTBench will serve as a benchmark tool to accelerate future investigations involving the evaluation of state-of-the-art MGT detection methods on their respective datasets and the development of more advanced MGT detection methods. Our source code and datasets are available at https://github.com/xinleihe/MGTBench.
ChatGPT detector could help spot cheaters using AI to write essays
People can use OpenAI's ChatGPT to generate almost any text they want A web tool called GPTZero can identify whether an essay was generated by the artificial intelligence chatbot ChatGPT with high accuracy. This could help identify cheating in schools and misinformation, but only if OpenAI, the company behind the popular chatbot, continues to gives access to the underlying AI models. OpenAI is reportedly working on inserting a watermark to text that its models generate.