The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and Spain

Jul-1-2024–arXiv.org Artificial Intelligence

We are 600 million Spanish speakers. We launched the #Somos600M Project because the diversity of the languages from LATAM, the Caribbean and Spain needs to be represented in Artificial Intelligence (AI) systems. Despite being the 7.5% of the world population, there is no open dataset to instruction-tune large language models (LLMs), nor a leaderboard to evaluate and compare them. In this paper, we present how we have created as an international open-source community the first versions of the instruction and evaluation datasets, indispensable resources for the advancement of Natural Language Processing (NLP) in our languages.

dataset, leaderboard, lenguajenaturalai, (16 more...)

arXiv.org Artificial Intelligence

Jul-1-2024

arXiv.org PDF

Add feedback

Country:
- South America
  - Peru (0.14)
  - Chile > Santiago Metropolitan Region
    - Santiago Province > Santiago (0.04)
- North America
  - Montserrat (0.04)
  - Central America (0.04)
  - United States > New Mexico
    - Santa Fe County > Santa Fe (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - Slovenia (0.04)
  - Spain > Galicia
    - Madrid (0.05)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
- Asia > Middle East
  - UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre:
- Research Report (0.40)

Industry:
- Education (0.94)
- Health & Medicine (0.93)
- Government (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found