orthography
IASC: Interactive Agentic System for ConLangs
Taguchi, Chihiro, Sproat, Richard
We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is 'translated' from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the 'translated' sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs 'know' about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. https://github.com/SakanaAI/IASC
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France (0.04)
- (15 more...)
- Research Report > New Finding (1.00)
- Instructional Material (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities
Digital humanities scholars increasingly use Large Language Models for historical document digitization, yet lack appropriate evaluation frameworks for LLM-based OCR. Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation. We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription. Using 18th-century Russian Civil font texts, we introduce novel metrics including Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), alongside protocols for contamination control and stability testing. We evaluate 12 multimodal LLMs, finding that Gemini and Qwen models outperform traditional OCR while exhibiting over-historicization: inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance. Our methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.
- Asia > Russia (0.28)
- Europe > Russia (0.14)
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
- (4 more...)
Linguistically Informed Tokenization Improves ASR for Underresourced Languages
Daul, Massimo, Tosolini, Alessio, Bowern, Claire
Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR's viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.
- Europe > Netherlands > Gelderland > Arnhem (0.04)
- Oceania > Australia > Northern Territory (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (2 more...)
Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak'wala Legacy Texts
Agarwal, Milind, Rosenblum, Daisy, Anastasopoulos, Antonios
Kwak'wala is an Indigenous language spoken in British Columbia, with a rich legacy of published documentation spanning more than a century, and an active community of speakers, teachers, and learners engaged in language revitalization. Over 11 volumes of the earliest texts created during the collaboration between Franz Boas and George Hunt have been scanned but remain unreadable by machines. Complete digitization through optical character recognition has the potential to facilitate transliteration into modern orthographies and the creation of other language technologies. In this paper, we apply the latest OCR techniques to a series of Kwak'wala texts only accessible as images, and discuss the challenges and unique adaptations necessary to make such technologies work for these real-world texts. Building on previous methods, we propose using a mix of off-the-shelf OCR methods, language identification, and masking to effectively isolate Kwak'wala text, along with post-correction models, to produce a final high-quality transcription.
- North America > United States (0.46)
- North America > Canada > British Columbia (0.25)
- Europe > Netherlands > South Holland > Leiden (0.05)
- (11 more...)
- Research Report (0.50)
- Overview (0.47)
Clarifying orthography: Orthographic transparency as compressibility
Torres, Charles J., Futrell, Richard
Orthographic transparency -- how directly spelling is related to sound -- lacks a unified, script-agnostic metric. Using ideas from algorithmic information theory, we quantify orthographic transparency in terms of the mutual compressibility between orthographic and phonological strings. Our measure provides a principled way to combine two factors that decrease orthographic transparency, capturing both irregular spellings and rule complexity in one quantity. We estimate our transparency measure using prequential code-lengths derived from neural sequence models. Evaluating 22 languages across a broad range of script types (alphabetic, abjad, abugida, syllabic, logographic) confirms common intuitions about relative transparency of scripts. Mutual compressibility offers a simple, principled, and general yardstick for orthographic transparency.
- Europe > Germany > Saxony > Leipzig (0.04)
- North America > United States > California > Orange County > Irvine (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- (2 more...)
Jochre 3 and the Yiddish OCR corpus
Urieli, Assaf, Clooney, Amber, Sigiel, Michelle, Leyfer, Grisha
We describe the construction of a publicly available Yiddish OCR Corpus, and describe and evaluate the open source OCR tool suite Jochre 3, including an Alto editor for corpus annotation, OCR software for Alto OCR layer generation, and a customizable OCR search engine. The current version of the Yiddish OCR corpus contains 658 pages, 186K tokens and 840K glyphs. The Jochre 3 OCR tool uses various fine-tuned YOLOv8 models for top-down page layout analysis, and a custom CNN network for glyph recognition. It attains a CER of 1.5% on our test corpus, far out-performing all other existing public models for Yiddish. We analyzed the full 660M word Yiddish Book Center with Jochre 3 OCR, and the new OCR is searchable through the Yiddish Book Center OCR search engine.
- North America > United States > New York (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (9 more...)
Comparative analysis of optical character recognition methods for S\'ami texts from the National Library of Norway
Enstad, Tita, Trosterud, Trond, Røsok, Marie Iversdatter, Beyer, Yngvil, Roald, Marie
Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the S\'ami documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in S\'ami languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing S\'ami texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for S\'ami languages, even with a moderate amount of manually annotated data.
- Europe > Norway (0.71)
- North America > United States (0.69)
PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model
This paper presents PolyIPA, a novel multilingual phoneme-to-grapheme conversion model designed for multilingual name transliteration, onomastic research, and information retrieval. The model leverages two helper models developed for data augmentation: IPA2vec for finding soundalikes across languages, and similarIPA for handling phonetic notation variations. Evaluated on a test set that spans multiple languages and writing systems, the model achieves a mean Character Error Rate of 0.055 and a character-level BLEU score of 0.914, with particularly strong performance on languages with shallow orthographies. The implementation of beam search further improves practical utility, with top-3 candidates reducing the effective error rate by 52.7\% (to CER: 0.026), demonstrating the model's effectiveness for cross-linguistic applications.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Croatia > Zagreb County > Zagreb (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
NoLoR: An ASR-Based Framework for Expedited Endangered Language Documentation with Neo-Aramaic as a Case Study
The documentation of the Neo-Aramaic dialects before their extinction has been described as the most urgent task in all of Semitology today. The death of this language will be an unfathomable loss to the descendents of the indigenous speakers of Aramaic, now predominantly diasporic after forced displacement due to violence. This paper develops an ASR model to expedite the documentation of this endangered language and generalizes the strategy in a new framework we call NoLoR.
- Asia > Armenia (0.05)
- Asia > Middle East > Republic of Türkiye (0.04)
- Asia > Middle East > Iraq (0.04)
- (7 more...)
Correcting FLORES Evaluation Dataset for Four African Languages
Abdulmumin, Idris, Mkhwanazi, Sthembiso, Mbooi, Mahlatse S., Muhammad, Shamsuddeen Hassan, Ahmad, Ibrahim Said, Putini, Neo, Mathebula, Miehleketo, Shingange, Matimba, Gwadabe, Tajuddeen, Marivate, Vukosi
This paper describes the corrections made to the FLORES evaluation (dev and devtest) dataset for four African languages, namely Hausa, Northern Sotho (Sepedi), Xitsonga and isiZulu. The original dataset, though groundbreaking in its coverage of low-resource languages, exhibited various inconsistencies and inaccuracies in the reviewed languages that could potentially hinder the integrity of the evaluation of downstream tasks in natural language processing (NLP), especially machine translation. Through a meticulous review process by native speakers, several corrections were identified and implemented, improving the dataset's overall quality and reliability. For each language, we provide a concise summary of the errors encountered and corrected, and also present some statistical analysis that measure the difference between the existing and corrected datasets. We believe that our corrections enhance the linguistic accuracy and reliability of the data and, thereby, contributing to more effective evaluation of NLP tasks involving the four African languages.
- Africa > West Africa (0.04)
- Africa > Southern Africa (0.04)
- Africa > South Africa > Gauteng > Pretoria (0.04)
- (13 more...)