Scaling Sign Language Translation
–Neural Information Processing Systems
Sign language translation (SL T) addresses the problem of translating information from a sign language in video to a spoken language in text. Existing studies, while showing progress, are often limited to narrow domains and/or few sign languages and struggle with open-domain tasks. In this paper, we push forward the frontier of SL T by scaling pretraining data, model size, and number of translation directions. We perform large-scale SL T pretraining on different data including 1) noisy multilingual Y ouTube SL T data, 2) parallel text corpora, and 3) SL T data augmented by translating video captions to other languages with off-the-shelf machine translation models. We unify different pretraining tasks with task-specific prompts under the encoder-decoder architecture, and initialize the SL T model with pretrained (m/By)T5 models across model sizes. SL T pretraining results on How2Sign and FLEURS-ASL#0 (ASL to 42 spoken languages) demonstrate the significance of data/model scaling and cross-lingual cross-modal transfer, as well as the feasibility of zero-shot SL T. We finetune the pretrained SL T models on 5 downstream open-domain SL T benchmarks covering 5 sign languages. Experiments show substantial quality improvements over the vanilla baselines, surpassing the previous state-of-the-art (SOT A) by wide margins.
Neural Information Processing Systems
Feb-18-2026, 05:22:25 GMT
- Country:
- Asia
- Europe
- North America
- Canada > Quebec (0.04)
- United States (0.04)
- Oceania > New Zealand (0.04)
- South America
- Colombia > Meta Department
- Villavicencio (0.04)
- Paraguay (0.04)
- Uruguay (0.04)
- Colombia > Meta Department
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Research Report
- Industry:
- Education > Curriculum > Subject-Specific Education (1.00)
- Technology: