LatinCy: Synthetic Trained Pipelines for Latin NLP

Burns, Patrick J.

arXiv.org Artificial Intelligence 

This paper introduces LatinCy, a set of trained general purpose Latin-language "core" pipelines for use with the spaCy natural language processing framework (Honnibal and Montani, 2023). These are end-to-end pipelines for taking plaintext Latin as input for basic NLP processing including sentence segmentation, word tokenization, lemmatization, part-of-speech and morphological tagging, dependency parsing, and named entity recognition (NER). Three models have so far been trained, named according to spaCy conventions: la_core_web_sm, la_core_web_md, and la_core_web_lg. To clarify, 'la' refers to the language code for Latin, 'core' refers to a pipeline that includes all of the components named above, including specifically NER; 'web' refers to the nature of the training data, specifically that the model is trained primarily on Universal Dependency treebanks; and'sm', 'md', and'lg' refer to the "size"--i.e., small, medium, or large--of the models, with'md' and'lg' models being larger because they include subword vectors that describe the vocabulary while'sm' models do not. The current default pipeline consists of the following spaCy components: 'tagger', 'morphologizer', 'trainable_lemmatizer' (i.e. the EditTreeLemmatizer based on Müller et al., 2015),

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found