AITopics | ocr error

Collaborating Authors

ocr error

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy

Guan, Shuhao, Lin, Moule, Xu, Cheng, Liu, Xinyi, Zhao, Jinman, Fan, Jiexin, Xu, Qi, Greene, Derek

arXiv.org Artificial IntelligenceNov-19-2025

This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to enhance both visual clarity and textual consistency, thereby improving text extraction from degraded historical documents. First, we synthesize document-image pairs from plaintext, rendering them with diverse fonts and layouts and then applying a randomly ordered set of degradation operations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-OCR model, fine-tuned on synthetic historical text pairs, addresses remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that the PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2025.acl-long.749

2505.20429

Country:

North America > Canada > Ontario (0.28)
Europe (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding

Agarwal, Amit, Panda, Srikant, Pachauri, Kulbhushan

arXiv.org Artificial IntelligenceNov-13-2025

In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG's capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : https://github.com/oracle-samples/fs-dag

data mining, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.1733

Country: Europe > Austria (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.89)
Information Technology > Data Science > Data Mining > Text Mining (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A Data-driven Investigation of Euphemistic Language: Comparing the usage of "slave" and "servant" in 19th century US newspapers

Park, Jaihyun, Cordell, Ryan

arXiv.org Artificial IntelligenceMar-19-2025

This study investigates the usage of "slave" and "servant" in the 19th century US newspapers using computational methods. While both terms were used to refer to enslaved African Americans, they were used in distinct ways. In the Chronicling America corpus, we included possible OCR errors by using FastText embedding and excluded text reprints to consider text reprint culture in the 19th century. Word2vec embedding was used to find semantically close words to "slave" and "servant" and log-odds ratio was calculated to identify over-represented discourse words in the Southern and Northern newspapers. We found that "slave" is associated with socio-economic, legal, and administrative words, however, "servant" is linked to religious words in the Northern newspapers while Southern newspapers associated "servant" with domestic and familial words. We further found that slave discourse words in Southern newspapers are more prevalent in Northern newspapers while servant discourse words from each side are prevalent in their own region. This study contributes to the understanding of how newspapers created different discourses around enslaved African Americans in the 19th century US.

artificial intelligence, natural language, text processing, (20 more...)

arXiv.org Artificial Intelligence

2503.15057

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
North America > United States > Texas (0.04)
North America > United States > Tennessee (0.04)
(16 more...)

Genre: Research Report > New Finding (0.46)

Industry: Media > News (1.00)

Technology:

Information Technology > Information Management (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

Piryani, Bhawna, Mozafari, Jamshid, Abdallah, Abdelrahman, Doucet, Antoine, Jatowt, Adam

arXiv.org Artificial IntelligenceFeb-23-2025

Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors -- imperfect extraction of the text, including character insertion, deletion and permutation -- can significantly impact downstream tasks like question-answering (QA). In this work, we introduce a multilingual QA dataset MultiOCR-QA, designed to analyze the effects of OCR noise on QA systems' performance. The MultiOCR-QA dataset comprises 60K question-answer pairs covering three languages, English, French, and German. The dataset is curated from OCR-ed old documents, allowing for the evaluation of OCR-induced challenges on question answering. We evaluate MultiOCR-QA on various levels and types of OCR errors to access the robustness of LLMs in handling real-world digitization errors. Our findings show that QA systems are highly prone to OCR induced errors and exhibit performance degradation on noisy OCR text.

dataset, multiocr-qa, ocr error, (13 more...)

arXiv.org Artificial Intelligence

2502.16781

Country:

Europe > Austria > Tyrol > Innsbruck (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
(12 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
(2 more...)

Add feedback

Adapting Multilingual Embedding Models to Historical Luxembourgish

Michail, Andrianos, Raclé, Corina Julia, Opitz, Juri, Clematide, Simon

arXiv.org Artificial IntelligenceFeb-11-2025

The growing volume of digitized historical texts requires effective semantic search using text embeddings. However, pre-trained multilingual models, typically evaluated on contemporary texts, face challenges with historical digitized content due to OCR noise and outdated spellings. We explore the use of multilingual embeddings for cross-lingual semantic search on historical Luxembourgish, a low-resource language. We collect historical Luxembourgish news articles spanning various time periods and use GPT-4o to segment and translate them into closely related languages, creating 20,000 parallel training sentences per language pair. We further create a historical bitext mining evaluation set and find that these models struggle to perform cross-lingual search on historical Luxembourgish. To address this, we propose a simple adaptation method using in-domain training data, achieving up to 98\% accuracy in cross-lingual evaluations. We release our adapted models and historical Luxembourgish-German/French bitexts to support further research.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2502.07938

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Europe > Switzerland > Zürich > Zürich (0.05)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Add feedback

Scrambled text: training Language Models to correct OCR errors using synthetic data

Bourne, Jonathan

arXiv.org Artificial IntelligenceSep-29-2024

OCR errors are common in digitised historical archives significantly affecting their usability and value. Generative Language Models (LMs) have shown potential for correcting these errors using the context provided by the corrupted text and the broader socio-cultural context, a process called Context Leveraging OCR Correction (CLOCR-C). However, getting sufficient training data for fine-tuning such models can prove challenging. This paper shows that fine-tuning a language model on synthetic data using an LM and using a character level Markov corruption process can significantly improve the ability to correct OCR errors. Models trained on synthetic data reduce the character error rate by 55% and word error rate by 32% over the base LM and outperform models trained on real data. Key findings include; training on under-corrupted data is better than over-corrupted data; non-uniform character level corruption is better than uniform corruption; More tokens-per-observation outperforms more observations for a fixed token budget. The outputs for this paper are a set of 8 heuristics for training effective CLOCR-C models, a dataset of 11,000 synthetic 19th century newspaper articles and scrambledtext a python library for creating synthetic corrupted data.

corruption, dataset, synthetic data, (16 more...)

arXiv.org Artificial Intelligence

2409.19735

Country:

Oceania > Australia (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Advancing Post-OCR Correction: A Comparative Study of Synthetic Data

Guan, Shuhao, Greene, Derek

arXiv.org Artificial IntelligenceAug-13-2024

This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance. Furthermore, we introduce a novel algorithm that leverages computer vision feature detection algorithms to calculate glyph similarity for constructing post-OCR synthetic data. Through experiments conducted across a variety of languages, including several low-resource ones, we demonstrate that models like ByT5 can significantly reduce Character Error Rates (CER) without the need for manually annotated data, and our proposed synthetic data generation method shows advantages over traditional methods, particularly in low-resource languages.

correction, dataset, experiment, (17 more...)

arXiv.org Artificial Intelligence

2408.02253

Country:

South America > Argentina (0.04)
Europe > Norway > Eastern Norway > Oslo (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Manrique-Gómez, Laura, Montes, Tony, Manrique, Rubén

arXiv.org Artificial IntelligenceJul-3-2024

Another substantial as key historical resources, contain a diverse project is the "Digging into Data Challenge". A range of information about political, economic, part of the Transatlantic Partnership for Social Sciences and cultural processes and are abundant due to and Humanities 2016, this initiative yielded focused efforts to preserve them within national a vast collection of 19th-century press materials archives. Indeed, the discipline of Digital Humanities, known as "Atlas - Oceanic Exchanges. Tracing which emphasizes the incorporation of digital Global Information Networks in Historical Papers" tools in humanities and social sciences research, (Exchanges). Other significant works include "Viral has spent much of the past three decades on the Texts: Mapping Networks of Reprinting in 19th-task of digitization, resulting in a wealth of curated Century Newspapers and Magazines" (Cordell and digital collections (Berry and Fagerjord, 2017; Dobson, Smith), a project that investigates 19th-century journalistic 2019). However, digitizing these corpora has reports to understand the culture of reprinting brought plenty of challenges in transcribing the in the United States before the Civil War, and images into machine-readable texts.

correction, dataset, ocr error, (10 more...)

arXiv.org Artificial Intelligence

2407.12838

Country:

North America > Panama (0.05)
South America > Venezuela (0.05)
South America > Colombia > Bogotá D.C. > Bogotá (0.05)
(7 more...)

Genre: Research Report (0.40)

Industry: Media > News (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.47)

Add feedback

Large Language Models for Simultaneous Named Entity Extraction and Spelling Correction

Whittaker, Edward, Kitagishi, Ikuo

arXiv.org Artificial IntelligenceMar-1-2024

Language Models (LMs) such as BERT, have been shown to perform well on the task of identifying Named Entities (NE) in text. A BERT LM is typically used as a classifier to classify individual tokens in the input text, or to classify spans of tokens, as belonging to one of a set of possible NE categories. In this paper, we hypothesise that decoder-only Large Language Models (LLMs) can also be used generatively to extract both the NE, as well as potentially recover the correct surface form of the NE, where any spelling errors that were present in the input text get automatically corrected. We fine-tune two BERT LMs as baselines, as well as eight open-source LLMs, on the task of producing NEs from text that was obtained by applying Optical Character Recognition (OCR) to images of Japanese shop receipts; in this work, we do not attempt to find or evaluate the location of NEs in the text. We show that the best fine-tuned LLM performs as well as, or slightly better than, the best fine-tuned BERT LM, although the differences are not significant. However, the best LLM is also shown to correct OCR errors in some cases, as initially hypothesised.

llm, ne category, receipt, (16 more...)

arXiv.org Artificial Intelligence

2403.00528

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Japan > Honshū > Tōhoku (0.04)
North America > United States > Maryland > Montgomery County > Gaithersburg (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Add feedback

A Part-of-Speech Tagger for Yiddish

Kulick, Seth, Ryant, Neville, Santorini, Beatrice, Wallenberg, Joel, Urieli, Assaf

arXiv.org Artificial IntelligenceAug-18-2023

We describe the construction and evaluation of a part-of-speech tagger for Yiddish. This is the first step in a larger project of automatically assigning part-of-speech tags and syntactic structure to Yiddish text for purposes of linguistic research. We combine two resources for the current work - an 80K-word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) and 650 million words of OCR'd Yiddish text from the Yiddish Book Center (YBC). Yiddish orthography in the YBC corpus has many spelling inconsistencies, and we present some evidence that even simple non-contextualized embeddings trained on YBC are able to capture the relationships among spelling variants without the need to first "standardize" the corpus. We also use YBC for continued pretraining of contexualized embeddings, which are then integrated into a tagger model trained and evaluated on the PPCHY. We evaluate the tagger performance on a 10-fold cross-validation split, showing that the use of the YBC text for the contextualized embeddings improves tagger performance. We conclude by discussing some next steps, including the need for additional annotated training and test data.

machine learning, natural language, ybc corpus, (18 more...)

arXiv.org Artificial Intelligence

2204.01175

Country:

Europe > Finland > Uusimaa > Helsinki (0.05)
North America > United States > Pennsylvania (0.04)
North America > United States > New York (0.04)
(6 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback