Resurrecting saturated LLM benchmarks with adversarial encoding
–arXiv.org Artificial Intelligence
Multiple-choice benchmarks show that Large Language Models (LLMs) excel in many knowledge domains. While LLMs often surpass human performance, recent studies reveal important limitations. For example, the GSM-Symbolic benchmark (Mirzadeh et al., 2024) shows that minor changes in mathematical questions significantly worsen model performance. This suggests LLMs rely on pattern-matching rather than formal reasoning, making them struggle with unfamiliar problem formats. LLMs may also show inconsistent factual recall, performing better under some conditions than others. For example, they often perform worse when presented with multiple tasks simultaneously (Wang, Kodner, & Rambow, 2024). We examine LLM knowledge robustness by testing how well models answer paired questions from multiple-choice benchmarks, and use the identified weaknesses to create a more challenging version of the MMLU benchmark.
arXiv.org Artificial Intelligence
Feb-10-2025
- Country:
- Africa > Niger (0.06)
- Asia
- China > Guangdong Province
- Shenzhen (0.04)
- Japan > Honshū
- Kansai > Kyoto Prefecture > Kyoto (0.04)
- China > Guangdong Province
- Europe > Belarus
- Minsk Region > Minsk (0.04)
- Genre:
- Research Report (0.84)
- Industry:
- Technology: