ACADATA: Parallel Dataset of Academic Data for Machine Translation
Lacunza, Iñaki, Gilabert, Javier Garcia, Fornaciari, Francesca De Luca, Aula-Blasco, Javier, Gonzalez-Agirre, Aitor, Melero, Maite, Villegas, Marta
–arXiv.org Artificial Intelligence
We present ACADATA, a high-quality parallel dataset for academic translation, that consists of two subsets: ACAD-TRAIN, which contains approximately 1.5 million author-generated paragraph pairs across 96 language directions and ACAD-BENCH, a curated evaluation set of almost 6,000 translations covering 12 directions. To validate its utility, we fine-tune two Large Language Models (LLMs) on ACAD-TRAIN and benchmark them on ACAD-BENCH against specialized machine-translation systems, general-purpose, open-weight LLMs, and several large-scale proprietary models. Experimental results demonstrate that fine-tuning on ACAD-TRAIN leads to improvements in academic translation quality by +6.1 and +12.4 d-BLEU points on average for 7B and 2B models respectively, while also improving long-context translation in a general domain by up to 24.9% when translating out of English. The fine-tuned top-performing model surpasses the best propietary and open-weight models on academic translation domain. By releasing ACAD-TRAIN, ACAD-BENCH and the fine-tuned models, we provide the community with a valuable resource to advance research in academic domain and long-context translation.
arXiv.org Artificial Intelligence
Oct-16-2025
- Country:
- Asia
- Japan > Kyūshū & Okinawa
- Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
- Middle East
- Israel (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Singapore (0.04)
- Japan > Kyūshū & Okinawa
- Europe
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Faroe Islands > Streymoy
- Tórshavn (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Belgium
- Brussels-Capital Region > Brussels (0.04)
- Flanders > East Flanders
- Ghent (0.04)
- Norway > Eastern Norway
- Slovenia (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.14)
- South Yorkshire > Sheffield (0.04)
- Finland (0.04)
- Spain
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Austria > Vienna (0.14)
- Ireland > Leinster
- North America
- Mexico > Mexico City
- Mexico City (0.04)
- Montserrat (0.04)
- United States
- California > Los Angeles County
- El Segundo (0.04)
- District of Columbia > Washington (0.04)
- New York > New York County
- New York City (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Washington > King County
- Seattle (0.04)
- California > Los Angeles County
- Mexico > Mexico City
- Oceania > Australia
- South America > Peru (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.48)
- Industry:
- Government (0.92)
- Health & Medicine > Therapeutic Area (1.00)
- Law (0.67)
- Leisure & Entertainment > Games (0.67)
- Technology: