LatinCy: Synthetic Trained Pipelines for Latin NLP

May-7-2023–arXiv.org Artificial Intelligence

This paper introduces LatinCy, a set of trained general purpose Latin-language "core" pipelines for use with the spaCy natural language processing framework (Honnibal and Montani, 2023). These are end-to-end pipelines for taking plaintext Latin as input for basic NLP processing including sentence segmentation, word tokenization, lemmatization, part-of-speech and morphological tagging, dependency parsing, and named entity recognition (NER). Three models have so far been trained, named according to spaCy conventions: la_core_web_sm, la_core_web_md, and la_core_web_lg. To clarify, 'la' refers to the language code for Latin, 'core' refers to a pipeline that includes all of the components named above, including specifically NER; 'web' refers to the nature of the training data, specifically that the model is trained primarily on Universal Dependency treebanks; and'sm', 'md', and'lg' refer to the "size"--i.e., small, medium, or large--of the models, with'md' and'lg' models being larger because they include subword vectors that describe the vocabulary while'sm' models do not. The current default pipeline consists of the following spaCy components: 'tagger', 'morphologizer', 'trainable_lemmatizer' (i.e. the EditTreeLemmatizer based on Müller et al., 2015),

artificial intelligence, natural language, pipeline, (18 more...)

arXiv.org Artificial Intelligence

May-7-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States > New York (0.04)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence > Natural Language
  - Text Processing (1.00)
  - Grammars & Parsing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found