Do Generalisation Results Generalise?

Boglioni, Matteo, Sgobbi, Andrea, Tavernini, Gabriel, Rita, Francesco, Mosbach, Marius, Pimentel, Tiago

Dec-9-2025–arXiv.org Artificial Intelligence

A large language model's (LLM's) out-of-distribution (OOD) generalisation ability is crucial to its deployment. Previous work assessing LLMs' generalisation performance, however, typically focuses on a single out-of-distribution dataset. This approach may fail to precisely evaluate the capabilities of the model, as the data shifts encountered once a model is deployed are much more diverse. In this work, we investigate whether OOD generalisation results generalise. More specifically, we evaluate a model's performance across multiple OOD testsets throughout a finetuning run; we then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance. This allows us to assess how correlated are generalisation performances once in-domain performance is controlled for. Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results: the existence of a positive or negative correlation between any two OOD testsets depends strongly on the specific choice of model analysed.

computational linguistic, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

Dec-9-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- Asia > Middle East (0.28)
- North America
  - United States (0.46)
  - Canada (0.46)
  - Mexico > Mexico City (0.14)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.76)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found