AITopics | orthography

Collaborating Authors

orthography

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

IASC: Interactive Agentic System for ConLangs

Taguchi, Chihiro, Sproat, Richard

arXiv.org Artificial IntelligenceOct-22-2025

We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is 'translated' from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the 'translated' sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs 'know' about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. https://github.com/SakanaAI/IASC

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.07591

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Massachusetts (0.27)

Genre:

Research Report > New Finding (1.00)
Instructional Material (1.00)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities

Levchenko, Maria

arXiv.org Artificial IntelligenceOct-9-2025

Digital humanities scholars increasingly use Large Language Models for historical document digitization, yet lack appropriate evaluation frameworks for LLM-based OCR. Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation. We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription. Using 18th-century Russian Civil font texts, we introduce novel metrics including Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), alongside protocols for contamination control and stability testing. We evaluate 12 multimodal LLMs, finding that Gemini and Qwen models outperform traditional OCR while exhibiting over-historicization: inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance. Our methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.06743

Country:

Europe > Italy (0.28)
Asia > Russia (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Linguistically Informed Tokenization Improves ASR for Underresourced Languages

Daul, Massimo, Tosolini, Alessio, Bowern, Claire

arXiv.org Artificial IntelligenceOct-9-2025

Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR's viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.06461

Country:

Europe (0.46)
North America > United States (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak'wala Legacy Texts

Agarwal, Milind, Rosenblum, Daisy, Anastasopoulos, Antonios

arXiv.org Artificial IntelligenceJun-3-2025

Kwak'wala is an Indigenous language spoken in British Columbia, with a rich legacy of published documentation spanning more than a century, and an active community of speakers, teachers, and learners engaged in language revitalization. Over 11 volumes of the earliest texts created during the collaboration between Franz Boas and George Hunt have been scanned but remain unreadable by machines. Complete digitization through optical character recognition has the potential to facilitate transliteration into modern orthographies and the creation of other language technologies. In this paper, we apply the latest OCR techniques to a series of Kwak'wala texts only accessible as images, and discuss the challenges and unique adaptations necessary to make such technologies work for these real-world texts. Building on previous methods, we propose using a mix of off-the-shelf OCR methods, language identification, and masking to effectively isolate Kwak'wala text, along with post-correction models, to produce a final high-quality transcription.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2506.01775

Country:

Europe (1.00)
North America > United States (0.46)
North America > Canada > British Columbia (0.25)

Genre:

Research Report (0.50)
Overview (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.88)

Add feedback

Clarifying orthography: Orthographic transparency as compressibility

Torres, Charles J., Futrell, Richard

arXiv.org Artificial IntelligenceMay-21-2025

Orthographic transparency -- how directly spelling is related to sound -- lacks a unified, script-agnostic metric. Using ideas from algorithmic information theory, we quantify orthographic transparency in terms of the mutual compressibility between orthographic and phonological strings. Our measure provides a principled way to combine two factors that decrease orthographic transparency, capturing both irregular spellings and rule complexity in one quantity. We estimate our transparency measure using prequential code-lengths derived from neural sequence models. Evaluating 22 languages across a broad range of script types (alphabetic, abjad, abugida, syllabic, logographic) confirms common intuitions about relative transparency of scripts. Mutual compressibility offers a simple, principled, and general yardstick for orthographic transparency.

artificial intelligence, machine learning, transparency, (17 more...)

arXiv.org Artificial Intelligence

2505.13657

Country:

North America > United States > California (0.28)
Europe (0.28)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

A computational system to handle the orthographic layer of tajwid in contemporary Quranic Orthography

Martínez, Alicia González

arXiv.org Artificial IntelligenceMay-19-2025

Contemporary Quranic Orthography (CQO) relies on a precise system of phonetic notation that can be traced back to the early stages of Islam, when the Quran was mainly oral in nature and the first written renderings of it served as memory aids for this oral tradition. The early systems of diacritical marks created on top of the Quranic Consonantal Text (QCT) motivated the creation and further development of a fine-grained system of phonetic notation that represented tajwid-the rules of recitation. We explored the systematicity of the rules of tajwid, as they are encountered in the Cairo Quran, using a fully and accurately encoded digital edition of the Quranic text. For this purpose, we developed a python module that can remove or add the orthographic layer of tajwid from a Quranic text in CQO. The interesting characteristic of these two sets of rules is that they address the complete Quranic text of the Cairo Quran, so they can be used as precise witnesses to study its phonetic and prosodic processes. From a computational point of view, the text of the Cairo Quran can be used as a linchpin to align and compare Quranic manuscripts, due to its richness and completeness. This will let us create a very powerful framework to work with the Arabic script, not just within an isolated text, but automatically exploring a specific textual phenomenon in other connected manuscripts. Having all the texts mapped among each other can serve as a powerful tool to study the nature of the notation systems of diacritics added to the consonantal skeleton.

artificial intelligence, expert system, word boundary, (18 more...)

arXiv.org Artificial Intelligence

2505.11379

Country:

Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.67)
Asia > Middle East > Oman (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.34)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.34)

Add feedback

Jochre 3 and the Yiddish OCR corpus

Urieli, Assaf, Clooney, Amber, Sigiel, Michelle, Leyfer, Grisha

arXiv.org Artificial IntelligenceJan-14-2025

We describe the construction of a publicly available Yiddish OCR Corpus, and describe and evaluate the open source OCR tool suite Jochre 3, including an Alto editor for corpus annotation, OCR software for Alto OCR layer generation, and a customizable OCR search engine. The current version of the Yiddish OCR corpus contains 658 pages, 186K tokens and 840K glyphs. The Jochre 3 OCR tool uses various fine-tuned YOLOv8 models for top-down page layout analysis, and a custom CNN network for glyph recognition. It attains a CER of 1.5% on our test corpus, far out-performing all other existing public models for Yiddish. We analyzed the full 660M word Yiddish Book Center with Jochre 3 OCR, and the new OCR is searchable through the Yiddish Book Center OCR search engine.

artificial intelligence, natural language, text processing, (16 more...)

arXiv.org Artificial Intelligence

2501.08442

Country:

Europe (1.00)
North America > United States > Massachusetts (0.14)

Genre: Research Report (1.00)

Industry: Media (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

Comparative analysis of optical character recognition methods for S\'ami texts from the National Library of Norway

Enstad, Tita, Trosterud, Trond, Røsok, Marie Iversdatter, Beyer, Yngvil, Roald, Marie

arXiv.org Artificial IntelligenceJan-13-2025

Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the S\'ami documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in S\'ami languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing S\'ami texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for S\'ami languages, even with a moderate amount of manually annotated data.

artificial intelligence, machine learning, proceedings, (18 more...)

arXiv.org Artificial Intelligence

2501.073

Country:

Europe > Norway (0.71)
North America > United States (0.69)

Genre: Research Report > New Finding (0.54)

Industry: Energy > Oil & Gas (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model

Lauc, Davor

arXiv.org Artificial IntelligenceDec-12-2024

This paper presents PolyIPA, a novel multilingual phoneme-to-grapheme conversion model designed for multilingual name transliteration, onomastic research, and information retrieval. The model leverages two helper models developed for data augmentation: IPA2vec for finding soundalikes across languages, and similarIPA for handling phonetic notation variations. Evaluated on a test set that spans multiple languages and writing systems, the model achieves a mean Character Error Rate of 0.055 and a character-level BLEU score of 0.914, with particularly strong performance on languages with shallow orthographies. The implementation of beam search further improves practical utility, with top-3 candidates reducing the effective error rate by 52.7\% (to CER: 0.026), demonstrating the model's effectiveness for cross-linguistic applications.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.09102

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Croatia > Zagreb County > Zagreb (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

NoLoR: An ASR-Based Framework for Expedited Endangered Language Documentation with Neo-Aramaic as a Case Study

Nazari, Matthew

arXiv.org Artificial IntelligenceDec-5-2024

The documentation of the Neo-Aramaic dialects before their extinction has been described as the most urgent task in all of Semitology today. The death of this language will be an unfathomable loss to the descendents of the indigenous speakers of Aramaic, now predominantly diasporic after forced displacement due to violence. This paper develops an ASR model to expedite the documentation of this endangered language and generalizes the strategy in a new framework we call NoLoR.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.04717

Country:

Asia > Armenia (0.05)
Asia > Middle East > Republic of Türkiye (0.04)
Asia > Middle East > Iraq (0.04)
(7 more...)

Genre: Research Report (0.40)

Industry: Government (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)

Add feedback