Resurrecting saturated LLM benchmarks with adversarial encoding

Feb-10-2025–arXiv.org Artificial Intelligence

Multiple-choice benchmarks show that Large Language Models (LLMs) excel in many knowledge domains. While LLMs often surpass human performance, recent studies reveal important limitations. For example, the GSM-Symbolic benchmark (Mirzadeh et al., 2024) shows that minor changes in mathematical questions significantly worsen model performance. This suggests LLMs rely on pattern-matching rather than formal reasoning, making them struggle with unfamiliar problem formats. LLMs may also show inconsistent factual recall, performing better under some conditions than others. For example, they often perform worse when presented with multiple tasks simultaneously (Wang, Kodner, & Rambow, 2024). We examine LLM knowledge robustness by testing how well models answer paired questions from multiple-choice benchmarks, and use the identified weaknesses to create a more challenging version of the MMLU benchmark.

benchmark, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

Feb-10-2025

arXiv.org PDF

Add feedback

Country:
- Africa > Niger (0.06)
- Asia
  - China > Guangdong Province
    - Shenzhen (0.04)
  - Japan > Honshū
    - Kansai > Kyoto Prefecture > Kyoto (0.04)
- Europe > Belarus
  - Minsk Region > Minsk (0.04)

Genre:
- Research Report (0.84)

Industry:
- Health & Medicine > Therapeutic Area
  - Immunology (0.32)
  - Infections and Infectious Diseases (0.52)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)