GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages
Dang, Trung Duc Anh, D'Elia, Ferdinando Pio
–arXiv.org Artificial Intelligence
As social-media platforms emerge and evolve faster than the regulations meant to oversee them, automated detoxification might serve as a timely tool for moderators to enforce safe discourse at scale. We here describe our submission to the PAN 2025 Multilingual Text Detoxification Challenge, which rewrites toxic single-sentence inputs into neutral paraphrases across 15 typologically diverse languages. Building on a 12B-parameter Gemma-3 multilingual transformer, we apply parameter-efficient LoRA SFT fine-tuning and prompting techniques like few-shot and Chain-of-Thought. Our multilingual training corpus combines 3,600 human-authored parallel pairs, 21,600 machine-translated synthetic pairs, and model-generated pairs filtered by Jaccard thresholds. At inference, inputs are enriched with three LaBSE-retrieved neighbors and explicit toxic-span annotations. Evaluated via Style Transfer Accuracy, LaBSE-based semantic preservation, and xCOMET fluency, our system ranks first on high-resource and low-resource languages. Ablations show +0.081 joint score increase from few-shot examples and +0.088 from basic CoT prompting. ANOVA analysis identifies language resource status as the strongest predictor of performance ($η^2$ = 0.667, p < 0.01).
arXiv.org Artificial Intelligence
Oct-3-2025
- Country:
- Africa
- East Africa (0.04)
- Middle East (0.04)
- Asia
- East Asia (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.14)
- Russia (0.04)
- Singapore (0.04)
- Europe
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- France > Auvergne-Rhône-Alpes
- Eastern Europe (0.04)
- Western Europe (0.04)
- Middle East (0.04)
- Russia (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Spain > Galicia
- Madrid (0.04)
- Ireland > Leinster
- North America
- Dominican Republic (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States > New York (0.04)
- Oceania > Australia
- South America > Argentina (0.04)
- Africa
- Genre:
- Research Report > New Finding (0.66)
- Technology: