TravelBench : Exploring LLM Performance in Low-Resource Domains
Billa, Srinivas, Jing, Xiaonan
–arXiv.org Artificial Intelligence
Results on existing LLM benchmarks capture little information over the model capabilities in low-resource tasks, making it difficult to develop effective solutions in these domains. To address these challenges, we curated 14 travel-domain datasets spanning 7 common NLP tasks using anonymised data from real-world scenarios, and analysed the performance across LLMs. We report on the accuracy, scaling behaviour, and reasoning capabilities of LLMs in a variety of tasks. Our results confirm that general benchmarking results are insufficient for understanding model performance in low-resource tasks. Despite the amount of training FLOPs, out-of-the-box LLMs hit performance bottlenecks in complex, domain-specific scenarios. Furthermore, reasoning provides a more significant boost for smaller LLMs by making the model a better judge on certain tasks.
arXiv.org Artificial Intelligence
Oct-6-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- Monaco (0.04)
- United Kingdom (0.04)
- North America > United States
- Massachusetts > Middlesex County
- Newton (0.04)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Massachusetts > Middlesex County
- South America > Chile
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Consumer Products & Services > Travel (1.00)
- Technology: