When Many-Shot Prompting Fails: An Empirical Study of LLM Code Translation
Oskooei, Amirkia Rafiei, Cosdan, Kaan Baturalp, Isiktas, Husamettin, Aktas, Mehmet S.
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) with vast context windows offer new avenues for in-context learning (ICL), where providing many examples ("many-shot" prompting) is often assumed to enhance performance. We investigate this assumption for the complex task of code translation. Through a large-scale empirical study of over 90,000 translations, we systematically evaluate the impact of scaling in-context examples from zero-shot to many-shot configurations of up to 625 examples, with prompts spanning from approximately 100,000 to 800,000 tokens. Our findings reveal a "many-shot paradox": while static similarity metrics may modestly improve with more examples, functional correctness consistently peaks with few-shot prompting (5-25 examples). Providing substantially more examples often degrades this crucial functional performance. This study highlights that for code translation, the quality of a few well-chosen examples outweighs sheer quantity, challenging the universal efficacy of "more is better" for ICL and underscoring the task-dependent nature of optimal prompting strategies. Our results have significant implications for effectively leveraging LLMs in software engineering.
arXiv.org Artificial Intelligence
Dec-10-2025
- Country:
- Asia > Middle East
- Republic of Türkiye > Istanbul Province > Istanbul (0.05)
- Europe > Middle East
- Republic of Türkiye > Istanbul Province > Istanbul (0.05)
- North America > United States
- New York > New York County > New York City (0.04)
- South America > Brazil
- Rio de Janeiro > Rio de Janeiro (0.06)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.87)
- Technology: