Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking
Fung, Yi, Zhao, Ruining, Doo, Jae, Sun, Chenkai, Ji, Heng
–arXiv.org Artificial Intelligence
Pretrained large language models have revolutionized many applications but still face challenges related to cultural bias and a lack of cultural commonsense knowledge crucial for guiding cross-culture communication and interactions. Recognizing the shortcomings of existing methods in capturing the diverse and rich cultures across the world, this paper introduces a novel approach for massively multicultural knowledge acquisition. Specifically, our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages. Leveraging this valuable source of data collection, we construct the CultureAtlas dataset, which covers a wide range of sub-country level geographical regions and ethnolinguistic groups, with data cleaning and preprocessing to ensure textual assertion sentence self-containment, as well as fine-grained cultural profile information extraction. Our dataset not only facilitates the evaluation of language model performance in culturally diverse contexts but also serves as a foundational tool for the development of culturally sensitive and aware language models. Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI, to promote a more inclusive and balanced representation of global cultures in the digital domain.
arXiv.org Artificial Intelligence
Feb-14-2024
- Country:
- Africa
- Togo (0.04)
- Lesotho (0.04)
- Burkina Faso (0.04)
- Sudan (0.14)
- Kenya (0.04)
- Cabo Verde (0.04)
- South Africa (0.04)
- Botswana (0.04)
- Central African Republic (0.04)
- Ghana (0.04)
- Middle East
- Zimbabwe (0.04)
- Uganda (0.04)
- Tanzania (0.04)
- South Sudan (0.04)
- Liberia (0.04)
- Angola (0.04)
- Cameroon (0.04)
- Burundi (0.04)
- Guinea-Bissau (0.04)
- Benin (0.04)
- Zambia (0.04)
- Asia
- Brunei (0.14)
- Cambodia (0.04)
- Malaysia (0.04)
- Macao (0.14)
- North Korea (0.04)
- Kazakhstan (0.04)
- Uzbekistan (0.04)
- Bangladesh (0.04)
- India > Rajasthan
- Kota (0.04)
- Bhutan (0.05)
- Azerbaijan (0.04)
- Vietnam (0.04)
- Japan (0.04)
- Indonesia (0.04)
- Laos (0.04)
- Middle East
- Bahrain (0.04)
- Republic of Türkiye (0.04)
- Iraq (0.04)
- Syria (0.04)
- Yemen (0.04)
- Iran (0.04)
- Kuwait > Capital Governorate
- Kuwait City (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Jordan (0.04)
- Lebanon (0.04)
- Israel (0.04)
- China
- Timor-Leste (0.04)
- Armenia (0.04)
- South Korea (0.04)
- Kyrgyzstan (0.04)
- Sri Lanka (0.04)
- Turkmenistan (0.04)
- Thailand (0.04)
- Singapore (0.05)
- Tajikistan (0.04)
- Afghanistan (0.04)
- Europe
- Belarus (0.04)
- Hungary (0.04)
- United Kingdom (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Sweden (0.04)
- Ukraine (0.04)
- Belgium (0.04)
- Lithuania (0.04)
- Latvia (0.04)
- Spain > Catalonia (0.04)
- Greece (0.04)
- Italy (0.04)
- France (0.04)
- Serbia (0.04)
- Albania (0.04)
- Switzerland (0.04)
- Iceland (0.04)
- Germany (0.04)
- Andorra (0.04)
- Liechtenstein (0.04)
- Bulgaria (0.04)
- Austria (0.04)
- Bosnia and Herzegovina (0.04)
- North America
- Haiti (0.04)
- The Bahamas (0.04)
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- United States
- Illinois (0.04)
- New York > New York County
- New York City (0.04)
- Jamaica (0.04)
- Barbados (0.04)
- Guatemala (0.04)
- Belize (0.04)
- Honduras (0.04)
- Antigua and Barbuda (0.04)
- Dominican Republic (0.04)
- Trinidad and Tobago (0.04)
- Oceania
- South America
- Africa
- Genre:
- Research Report (1.00)
- Industry:
- Technology: