TRIDIS: A Comprehensive Medieval and Early Modern Corpus for HTR and NER

Aguilar, Sergio Torres

arXiv.org Artificial Intelligence 

--This paper introduces TRIDIS (Tria Digita Scri-bunt), an open-source corpus of medieval and early modern manuscripts. TRIDIS aggregates multiple legacy collections (all published under open licenses) and incorporates large metadata descriptions. While prior publications referenced some portions of this corpus, here we provide a unified overview with a stronger focus on its constitution. We describe (i) the narrative, chronological, and editorial background of each major sub-corpus, (ii) its semi-diplomatic transcription rules (expansion, normalization, punctuation), (iii) a strategy for challenging out-of-domain test splits driven by outlier detection in a joint embedding space, and (iv) preliminary baseline experiments using TrOCR and MiniCPM-Llama3-V 2.5 comparing random and outlier-based test partitions. Overall, TRIDIS is designed to stimulate joint robust Handwritten T ext Recognition (HTR) and Named Entity Recognition (NER) research across medieval and early modern textual heritage.