AITopics | tesseract

Collaborating Authors

tesseract

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A U-Net and Transformer Pipeline for Multilingual Image Translation

Sahay, Siddharth, Agarwal, Radhika

arXiv.org Artificial IntelligenceOct-28-2025

This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. These detected regions are then processed by Tesseract to extract the source text. This extracted text is fed into a custom Transformer model trained from scratch on a multilingual parallel corpus spanning 5 languages. Unlike systems reliant on monolithic pre-trained models, our architecture emphasizes full customization and adaptability. The system is evaluated on its text detection accuracy, text recognition quality, and translation performance via BLEU scores. The complete pipeline demonstrates promising results, validating the viability of a custom-built system for translating text directly from images.

machine learning, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2510.23554

Country:

North America > United States (0.69)
Europe (0.68)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Regression-aware Continual Learning for Android Malware Detection

Ghiani, Daniele, Angioni, Daniele, Piras, Giorgio, Sotgiu, Angelo, Minnei, Luca, Gupta, Srishti, Pintor, Maura, Roli, Fabio, Biggio, Battista

arXiv.org Artificial IntelligenceJul-25-2025

Malware evolves rapidly, forcing machine learning (ML)-based detectors to adapt continuously. With antivirus vendors processing hundreds of thousands of new samples daily, datasets can grow to billions of examples, making full retraining impractical. Continual learning (CL) has emerged as a scalable alternative, enabling incremental updates without full data access while mitigating catastrophic forgetting. In this work, we analyze a critical yet overlooked issue in this context: security regression. Unlike forgetting, which manifests as a general performance drop on previously seen data, security regression captures harmful prediction changes at the sample level, such as a malware sample that was once correctly detected but evades detection after a model update. Although often overlooked, regressions pose serious risks in security-critical applications, as the silent reintroduction of previously detected threats in the system may undermine users' trust in the whole updating process. To address this issue, we formalize and quantify security regression in CL-based malware detectors and propose a regression-aware penalty to mitigate it. Specifically, we adapt Positive Congruent Training (PCT) to the CL setting, preserving prior predictive behavior in a model-agnostic manner. Experiments on the ELSA, Tesseract, and AZ-Class datasets show that our method effectively reduces regression across different CL scenarios while maintaining strong detection performance over time.

machine learning, natural language, regression, (15 more...)

arXiv.org Artificial Intelligence

2507.18313

Country: Europe > Italy (0.48)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Transcribing Spanish Texts from the Past: Experiments with Transkribus, Tesseract and Granite

Torterolo-Orta, Yanco Amor, Macicior-Mitxelena, Jaione, Miguez-Lamanuzzi, Marina, García-Serrano, Ana

arXiv.org Artificial IntelligenceJul-8-2025

This article presents the experiments and results obtained by the GRESEL team in the IberLEF 2025 shared task PastReader: Transcribing Texts from the Past. Three types of experiments were conducted with the dual aim of participating in the task and enabling comparisons across different approaches. These included the use of a web-based OCR service, a traditional OCR engine, and a compact multimodal model. All experiments were run on consumer-grade hardware, which, despite lacking high-performance computing capacity, provided sufficient storage and stability. The results, while satisfactory, leave room for further improvement. Future work will focus on exploring new techniques and ideas using the Spanish-language dataset provided by the shared task, in collaboration with Biblioteca Nacional de España (BNE).

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2507.04878

Country:

Europe > Spain (0.88)
Asia > Middle East > UAE (0.28)

Genre: Research Report (0.82)

Industry:

Information Technology (0.68)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

TesserAct: Learning 4D Embodied World Models

Zhen, Haoyu, Sun, Qiao, Zhang, Hongxin, Li, Junyan, Zhou, Siyuan, Du, Yilun, Gan, Chuang

arXiv.org Artificial IntelligenceApr-30-2025

This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.

artificial intelligence, arxiv preprint arxiv, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2504.20995

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.87)

Add feedback

Comparative analysis of optical character recognition methods for S\'ami texts from the National Library of Norway

Enstad, Tita, Trosterud, Trond, Røsok, Marie Iversdatter, Beyer, Yngvil, Roald, Marie

arXiv.org Artificial IntelligenceJan-13-2025

Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the S\'ami documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in S\'ami languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing S\'ami texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for S\'ami languages, even with a moderate amount of manually annotated data.

artificial intelligence, machine learning, proceedings, (18 more...)

arXiv.org Artificial Intelligence

2501.073

Country:

Europe > Norway (0.71)
North America > United States (0.69)

Genre: Research Report > New Finding (0.54)

Industry: Energy > Oil & Gas (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

LMV-RPA: Large Model Voting-based Robotic Process Automation

Abdellatif, Osama, Ayman, Ahmed, Hamdi, Ali

arXiv.org Artificial IntelligenceDec-23-2024

Automating high-volume unstructured data processing is essential for operational efficiency. Optical Character Recognition (OCR) is critical but often struggles with accuracy and efficiency in complex layouts and ambiguous text. These challenges are especially pronounced in large-scale tasks requiring both speed and precision. This paper introduces LMV-RPA, a Large Model Voting-based Robotic Process Automation system to enhance OCR workflows. LMV-RPA integrates outputs from OCR engines such as Paddle OCR, Tesseract OCR, Easy OCR, and DocTR with Large Language Models (LLMs) like LLaMA 3 and Gemini-1.5-pro. Using a majority voting mechanism, it processes OCR outputs into structured JSON formats, improving accuracy, particularly in complex layouts. The multi-phase pipeline processes text extracted by OCR engines through LLMs, combining results to ensure the most accurate outputs. LMV-RPA achieves 99 percent accuracy in OCR tasks, surpassing baseline models with 94 percent, while reducing processing time by 80 percent. Benchmark evaluations confirm its scalability and demonstrate that LMV-RPA offers a faster, more reliable, and efficient solution for automating large-scale document processing tasks.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2412.17965

Genre: Research Report (0.64)

Industry: Information Technology (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages

Kashid, Harshvivek, Bhattacharyya, Pushpak

arXiv.org Artificial IntelligenceDec-14-2024

Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.

computational linguistic, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.15248

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(12 more...)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

Majeed, Ameer, Hassani, Hossein

arXiv.org Artificial IntelligenceAug-24-2024

Many languages have vast amounts of handwritten texts, such as ancient scripts about folktale stories and historical narratives or contemporary documents and letters. Digitization of those texts has various applications, such as daily tasks, cultural studies, and historical research. Syriac is an ancient, endangered, and low-resourced language that has not received the attention it requires and deserves. This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts as a starting point to build more digital services for this endangered language. A dataset was created, KHAMIS (inspired by the East Syriac poet, Khamis bar Qardahe), which consists of handwritten sentences in the East Syriac script. We used it to fine-tune the Tesseract-OCR engine's pretrained Syriac model on handwritten data. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. KHAMIS currently consists of 624 handwritten Syriac sentences collected from 31 university students and one professor, and it will be partially available online and the whole dataset available in the near future for development and research purposes. As a result, the handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets, respectively, and both a character error rate of 18.89-19.71% and a word error rate of 62.83-65.42% when evaluated on the test set, which is twice as better than the default Syriac model of Tesseract.

dataset, recognition, syriac, (14 more...)

arXiv.org Artificial Intelligence

2408.13631

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Middle East > Iraq > Kurdistan Region > Duhok Governorate > Duhok (0.05)
Asia > Middle East > Iraq > Erbil Governorate > Erbil (0.05)
(11 more...)

Genre: Research Report (1.00)

Industry:

Government (0.93)
Education > Educational Setting > Higher Education (0.39)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.85)

Add feedback

Image Based Character Recognition, Documentation System To Decode Inscription From Temple

G, Velmathi, M, Shangavelan, D, Harish, S, Krithikshun M

arXiv.org Artificial IntelligenceMay-21-2024

This project undertakes the training and analysis of optical character recognition OCR methods applied to 10th century ancient Tamil inscriptions discovered on the walls of the Brihadeeswarar Temple.The chosen OCR methods include Tesseract,a widely used OCR engine,using modern ICR techniques to pre process the raw data and a box editing software to finetune our model.The analysis with Tesseract aims to evaluate their effectiveness in accurately deciphering the nuances of the ancient Tamil characters.The performance of our model for the dataset are determined by their accuracy rate where the evaluated dataset divided into training set and testing set.By addressing the unique challenges posed by the script's historical context,this study seeks to contribute valuable insights to the broader field of OCR,facilitating improved preservation and interpretation of ancient inscriptions

inscription, pixel, temple inscription, (12 more...)

arXiv.org Artificial Intelligence

2405.17449

Country: Asia > India > Tamil Nadu > Chennai (0.06)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.85)

Add feedback

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

Yaseen, Blnd, Hassani, Hossein

arXiv.org Artificial IntelligenceApr-9-2024

Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish languages resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseracts built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.

dataset, historical document, recognition, (14 more...)

arXiv.org Artificial Intelligence

2404.06101

Country:

Asia > Middle East > Iraq > Baghdad Governorate > Baghdad (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
(17 more...)

Genre: Research Report > New Finding (1.00)

Industry: Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback