DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code
Agrawal, Shriyansh, Lau, Aidan, Shah, Sanyam, R, Ahan M, Zhu, Kevin, Dev, Sunishchal, Sharma, Vasu
–arXiv.org Artificial Intelligence
The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC $= 0.97$ to $0.99$ and macro-F1 $0.89$ to $0.94$ while reducing latency by $8$-$12\times$ and peak VRAM by $3$-$5\times$ at $512$-token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/renaming), performance retains $\geq 92%$ of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included.
arXiv.org Artificial Intelligence
Oct-23-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Monaco (0.04)
- Middle East > Malta
- Asia > Middle East
- Genre:
- Research Report (0.43)
- Technology: