Test-Time Scaling with Repeated Sampling Improves Multilingual Text Generation

May-29-2025–arXiv.org Artificial Intelligence

Inference-time scaling via repeated sampling has shown promise in reasoning tasks, but its effectiveness in multilingual generation remains underexplored. We evaluate this approach using perplexity- and reward-based verifiers on two multilingual benchmarks: the Aya Evaluation Suite and m-ArenaHard. Our results show consistent quality improvements, with gains exceeding 35% in some cases. While perplexity-based scoring is effective for open-ended prompts, only reward-based verifiers improve performance on tasks requiring reasoning (e.g., math, code). Our results demonstrate the broader utility of repeated sampling for multilingual text generation and underscore the importance of selecting right verifiers for the task.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

May-29-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.15)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Chatbot (0.99)
    - Large Language Model (0.77)
  - Machine Learning > Neural Networks
    - Deep Learning (0.51)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found