Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information

Crespo, Maria Clara Ramos Morales, Rocha, Maria Lina de Souza Jeannine, Sturzeneker, Mariana Lourenço, Serras, Felipe Ribas, de Mello, Guilherme Lamartine, Costa, Aline Silva, Palma, Mayara Feliciano, Mesquita, Renata Morais, Guets, Raquel de Paula, da Silva, Mariana Marques, Finger, Marcelo, de Sousa, Maria Clara Paixão, Namiuti, Cristiane, Monte, Vanessa Martins do

Mar-28-2023–arXiv.org Artificial Intelligence

This paper presents the first publicly available version of the Carolina Corpus and discusses its future directions. Carolina is a large open corpus of Brazilian Portuguese texts under construction using web-as-corpus methodology enhanced with provenance, typology, versioning, and text integrality. The corpus aims at being used both as a reliable source for research in Linguistics and as an important resource for Computer Science research on language models, contributing towards removing Portuguese from the set of low-resource languages. Here we present the construction of the corpus methodology, comparing it with other existing methodologies, as well as the corpus current state: Carolina's first public version has $653,322,577$ tokens, distributed over $7$ broad types. Each text is annotated with several different metadata categories in its header, which we developed using TEI annotation standards. We also present ongoing derivative works and invite NLP researchers to contribute with their own.

artificial intelligence, corpus, natural language, (16 more...)

arXiv.org Artificial Intelligence

Mar-28-2023

arXiv.org PDF

Add feedback

Country:
- South America
  - Colombia > Meta Department
    - Villavicencio (0.04)
  - Brazil
    - São Paulo (0.05)
    - Rio Grande do Sul > Porto Alegre (0.04)
- North America > United States
  - Minnesota > Hennepin County > Minneapolis (0.14)
- Europe
  - Portugal (0.04)
  - Netherlands > South Holland
    - Leiden (0.04)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence > Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found