llm-generated text
- Europe > Austria > Vienna (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Macao (0.04)
- (20 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)
Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text
Zhou, Hongyi, Zhu, Jin, Xu, Erhan, Ye, Kai, Yang, Ying, Shi, Chengchun
Modern large language models (LLMs) such as GPT, Claude, and Gemini have transformed the way we learn, work, and communicate. Y et, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLMgenerated content. In this paper, we start by presenting a geometric approach to demystify rewrite-based detection algorithms, revealing their underlying rationale and demonstrating their generalization ability. Building on this insight, we introduce a novel rewrite-based detection algorithm that adaptively learns the distance between the original and rewritten text. Theoretically, we demonstrate that employing an adaptively learned distance function is more effective for detection than using a fixed distance. Empirically, we conduct extensive experiments with over 100 settings, and find that our approach demonstrates superior performance over baseline algorithms in the majority of scenarios. In particular, it achieves relative improvements from 57.8% to 80.6% over the strongest baseline across different target LLMs (e.g., GPT, Claude, and Gemini). The past few years have witnessed the emergence and rapid development of large language models (LLMs) such as GPT (Hurst et al., 2024), DeepSeek (Liu et al., 2024), Claude (Anthropic, 2024), Gemini (Comanici et al., 2025), Grok (xAI, 2025) and Qwen (Y ang et al., 2025). Their impact is everywhere, from education, academia and software development to healthcare and everyday life (Arora & Arora, 2023; Chan & Hu, 2023; Hou et al., 2024). On one side of the coin, LLMs can support users with conversational question answering, help students learn more effectively, draft emails, write computer code, prepare presentation slides and more. On the other side, their ability to closely mimic human-written text also raises serious concerns, including the generation of biased or harmful content, the spread of misinformation in the news ecosystem, and the challenges related to authorship attribution and intellectual property (Dave et al., 2023; Fang et al., 2024; Messeri & Crockett, 2024; Mahajan et al., 2025; Laurito et al., 2025). Addressing these concerns requires effective algorithms to distinguish between human-written and LLM-generated text, which has become an active and popular research direction in recent literature (see Crothers et al., 2023; Wu et al., 2025, for reviews).
- Europe > Austria > Vienna (0.14)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (3 more...)
- Overview (1.00)
- Research Report > New Finding (0.45)
- Media (0.54)
- Education (0.48)
- Health & Medicine (0.48)
Detecting LLM-Generated Text with Performance Guarantees
Zhou, Hongyi, Zhu, Jin, Yang, Ying, Shi, Chengchun
Large language models (LLMs) such as GPT, Claude, Gemini, and Grok have been deeply integrated into our daily life. They now support a wide range of tasks -- from dialogue and email drafting to assisting with teaching and coding, serving as search engines, and much more. However, their ability to produce highly human-like text raises serious concerns, including the spread of fake news, the generation of misleading governmental reports, and academic misconduct. To address this practical problem, we train a classifier to determine whether a piece of text is authored by an LLM or a human. Our detector is deployed on an online CPU-based platform https://huggingface.co/spaces/stats-powered-ai/StatDetectLLM, and contains three novelties over existing detectors: (i) it does not rely on auxiliary information, such as watermarks or knowledge of the specific LLM used to generate the text; (ii) it more effectively distinguishes between human- and LLM-authored text; and (iii) it enables statistical inference, which is largely absent in the current literature. Empirically, our classifier achieves higher classification accuracy compared to existing detectors, while maintaining type-I error control, high statistical power, and computational efficiency.
- Media > News (1.00)
- Information Technology (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text
We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.
LiteraryTaste: A Preference Dataset for Creative Writing Personalization
Chung, John Joon Young, Padmakumar, Vishakh, Roemmele, Melissa, Wang, Yi, Sun, Yuqian, Wang, Tiffany, Almeda, Shm Garanganao, Halperin, Brett A., Lu, Yuwen, Kreminski, Max
People have different creative writing preferences, and large language models (LLMs) for these tasks can benefit from adapting to each user's preferences. However, these models are often trained over a dataset that considers varying personal tastes as a monolith. To facilitate developing personalized creative writing LLMs, we introduce LiteraryTaste, a dataset of reading preferences from 60 people, where each person: 1) self-reported their reading habits and tastes (stated preference), and 2) annotated their preferences over 100 pairs of short creative writing texts (revealed preference). With our dataset, we found that: 1) people diverge on creative writing preferences, 2) finetuning a transformer encoder could achieve 75.8% and 67.7% accuracy when modeling personal and collective revealed preferences, and 3) stated preferences had limited utility in modeling revealed preferences. With an LLM-driven interpretability pipeline, we analyzed how people's preferences vary. We hope our work serves as a cornerstone for personalizing creative writing technologies.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- (17 more...)
PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks
Zha, Yiwei, Min, Rui, Sushmita, Shanu
While AI-generated text (AIGT) detectors achieve over 90\% accuracy on direct LLM outputs, they fail catastrophically against iteratively-paraphrased content. We investigate why iteratively-paraphrased text -- itself AI-generated -- evades detection systems designed for AIGT identification. Through intrinsic mechanism analysis, we reveal that iterative paraphrasing creates an intermediate laundering region characterized by semantic displacement with preserved generation patterns, which brings up two attack categories: paraphrasing human-authored text (authorship obfuscation) and paraphrasing LLM-generated text (plagiarism evasion). To address these vulnerabilities, we introduce PADBen, the first benchmark systematically evaluating detector robustness against both paraphrase attack scenarios. PADBen comprises a five-type text taxonomy capturing the full trajectory from original content to deeply laundered text, and five progressive detection tasks across sentence-pair and single-sentence challenges. We evaluate 11 state-of-the-art detectors, revealing critical asymmetry: detectors successfully identify the plagiarism evasion problem but fail for the case of authorship obfuscation. Our findings demonstrate that current detection approaches cannot effectively handle the intermediate laundering region, necessitating fundamental advances in detection architectures beyond existing semantic and stylistic discrimination methods. For detailed code implementation, please see https://github.com/JonathanZha47/PadBen-Paraphrase-Attack-Benchmark.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Information Technology > Security & Privacy (1.00)
- Government > Military (1.00)
Identifying the Periodicity of Information in Natural Language
Ou, Yulin, Wang, Yu, Xu, Yang, Buschmeier, Hendrik
Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (8 more...)
Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations
Hassan, Syed Zohaib, Halvorsen, Pål, Johnson, Miriam S., Lison, Pierre
Large Language Models (LLMs), predominantly trained on adult conversational data, face significant challenges when generating authentic, child-like dialogue for specialized applications. We present a comparative study evaluating five different LLMs (GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, and NorBloom-7b) to generate age-appropriate Norwegian conversations for children aged 5 and 9 years. Through a blind evaluation by eleven education professionals using both real child interview data and LLM-generated text samples, we assessed authenticity and developmental appropriateness. Our results show that evaluators achieved strong inter-rater reliability (ICC=0.75) and demonstrated higher accuracy in age prediction for younger children (5-year-olds) compared to older children (9-year-olds). While GPT-4 and NorBloom-7b performed relatively well, most models generated language perceived as more linguistically advanced than the target age groups. These findings highlight critical data-related challenges in developing LLM systems for specialized applications involving children, particularly in low-resource languages where comprehensive age-appropriate lexical resources are scarce.
- Europe > Norway > Eastern Norway > Oslo (0.05)
- Europe > Sweden (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?
Geng, Mingmeng, Poibeau, Thierry
With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely "LLM-generated text". Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.
- North America > United States > Virginia (0.04)
- Asia > Middle East > Israel (0.04)
- Information Technology > Security & Privacy (0.94)
- Education (0.93)
ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection
Khairallah, Ali, Zubiaga, Arkaitz
Misuse of LLM-generated texts Modern LLMs are increasingly capable of generating highly fluent humanlike texts and adaptive to multiple dialects across genres [21]. While this progress unlocks many opportunities towards resolving daily life challenges, it also introduces risks in distinguishing between human-and machine-generated texts [22]. Undetected contents can cause serious cyber threats including misinformation, academic dishonesty, and even more aggressive consequences with phishing, smishing, and social engineering, where convincing texts are often used to manipulate individuals and organizations [23, 24]. For instance, the Anti-Phishing Working Group (APWG) have indicated in their phishing activity trends report that 932,923 phishing attacks were recorded worldwide in the third quarter of 2024, highlighting a significant increase in smishing by 22% during the same period. APWG also reported that social media platforms were the most targeted sector, representing 30.5% of all phishing attacks [24]. Furthermore, according to the FBI Internet Crime Report, losses due to cybercrimes in the United States have exceeded $12.5 billion in 2023 [25]. These examples illustrate how malicious manoeuvres can exploit machine-generated texts to scale deception. Hence, robust methods for detecting LLM-generated texts are urgently needed particularly with linguistically sensitive scripts such as Arabic, where formal, legal, and religious texts require extra reliability.
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
- North America > United States > Florida > Duval County > Jacksonville (0.04)
- (7 more...)