BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

Myung, Junho, Lee, Nayeon, Zhou, Yi, Jin, Jiho, Putri, Rifki Afina, Antypas, Dimosthenis, Borkakoty, Hsuvas, Kim, Eunsu, Perez-Almendros, Carla, Ayele, Abinew Ali, Gutiérrez-Basulto, Víctor, Ibáñez-García, Yazmín, Lee, Hwaran, Muhammad, Shamsuddeen Hassan, Park, Kiwoong, Rzayev, Anar Sabuhi, White, Nina, Yimam, Seid Muhie, Pilehvar, Mohammad Taher, Ousidhoum, Nedjma, Camacho-Collados, Jose, Oh, Alice

Jun-14-2024–arXiv.org Artificial Intelligence

Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play, or the sports they practice in school is common cultural knowledge but uncommon in easily collected online sources, especially for underrepresented cultures. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We construct the benchmark to include two formats of questions: short-answer and multiple-choice. We show that LLMs perform better for cultures that are highly represented online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. For cultures represented by mid-to-high-resource languages, LLMs perform better in their local languages, but for cultures represented by low-resource languages, LLMs perform better in English than the local languages. We make our dataset publicly available at: https://github.com/nlee0212/BLEnD.

annotation, annotator, country region, (16 more...)

arXiv.org Artificial Intelligence

Jun-14-2024

arXiv.org PDF

Add feedback

Country:
- South America
  - Brazil (0.04)
  - Colombia > Meta Department
    - Villavicencio (0.04)
- North America
  - Mexico (0.05)
  - United States
    - Washington > King County
      - Seattle (0.04)
    - New York > New York County
      - New York City (0.04)
    - Massachusetts > Suffolk County
      - Boston (0.04)
- Europe
  - Spain (0.14)
  - United Kingdom (0.14)
  - Greece (0.04)
  - France (0.04)
  - Russia (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
- Asia
  - South Korea (0.14)
  - Azerbaijan (0.05)
  - China > Hong Kong (0.04)
  - Singapore (0.04)
  - Southeast Asia (0.04)
  - Russia (0.04)
  - North Korea > Pyongyang
    - Pyongyang (0.04)
  - Middle East
    - Republic of Türkiye (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
    - Iran > Tehran Province
      - Tehran (0.04)
  - Indonesia
    - Java > West Java (0.05)
    - Bali (0.04)
- Africa
  - Nigeria (0.04)
  - Middle East > Algeria (0.04)
  - Ethiopia > Amhara Region
    - Bahir Dar (0.04)

Genre:
- Research Report > New Finding (0.45)

Industry:
- Leisure & Entertainment > Sports (1.00)
- Government (0.93)
- Media (0.87)
- Education > Health & Safety
  - School Nutrition (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found