How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

Chopra, Muskaan, Sparrenberg, Lorenz, Khanna, Sarthak, Sifa, Rafet

Nov-14-2025–arXiv.org Artificial Intelligence

Abstract--Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC = 0.77 with F1-ERR = 0.98 on SynCED-EnDe 2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). In contrast, ultra-small models (< 0.6 B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs-augmented with lightweight calibration and small-sample supervision, can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Nov-14-2025

arXiv.org PDF

Add feedback

Country:
- North America > Mexico (0.28)
- Europe > Austria (0.28)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Machine Translation (1.00)
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found