AITopics | Optical Character Recognition

Collaborating Authors

Optical Character Recognition

Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.

News Overviews Instructional Materials AI-Alerts Classics

An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW

Mehta, Prateek, Patil, Anasuya

arXiv.org Artificial IntelligenceJun-19-2025

Abstract: Knowledge extraction just by listening to sounds is known a s a distinctive property. Visually impaired people are dependent solely on Braille books & audio recordings provided by NGOs. Owing to many constraints in above two approaches blind people can't access the book of their choice. As the speech form is a more effective means of communication than text as blind and visually impaired persons can easily respond to sounds. This paper aims to develop an accurate, reliable, cost effective, and user - friendly optical character recognition (OCR) based speech synthesis system.

artificial intelligence, optical character recognition, speech synthesis, (11 more...)

arXiv.org Artificial Intelligence

2506.15029

Country: Asia > India (0.29)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.60)

Technology: Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)

Add feedback

The best part of the future is finally having a permanent replacement for this annoying technology

We don't have flying cars, jetpacks haven't replaced walking, and I have not seen a single sign that we're all pivoting to wearing matching silver jumpsuits. The future is kind of lame. SwiftScan VIP is a scanner tool that basically replaces half of your old office equipment with an app that works on iOS and Android devices. It's also a lot cheaper than some desktop scanners, and you don't need to replace it every few years. During this limited-time sale, you can get a SwiftScan VIP Lifetime Subscription for only 41.99 (it's usually 199.99).

artificial intelligence, optical character recognition, permanent replacement, (5 more...)

Popular Science

Industry: Information Technology (0.73)

Technology:

Information Technology > Communications > Mobile (0.93)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.33)

Add feedback

Reading in the Dark with Foveated Event Vision

Brander, Carl, Cioffi, Giovanni, Messikommer, Nico, Scaramuzza, Davide

arXiv.org Artificial IntelligenceJun-10-2025

Current smart glasses equipped with RGB cameras struggle to perceive the environment in low-light and high-speed motion scenarios due to motion blur and the limited dynamic range of frame cameras. Additionally, capturing dense images with a frame camera requires large bandwidth and power consumption, consequently draining the battery faster . These challenges are especially relevant for developing algorithms that can read text from images. In this work, we propose a novel event-based Optical Character Recognition (OCR) approach for smart glasses. By using the eye gaze of the user, we foveate the event stream to significantly reduce bandwidth by around 98% while exploiting the benefits of event cameras in high-dynamic and fast scenes. Our proposed method performs deep binary reconstruction trained on synthetic data and leverages multi-modal LLMs for OCR, outperforming traditional OCR solutions. Our results demonstrate the ability to read text in low light environments where RGB cameras struggle while using up to 2'400 times less bandwidth than a wearable RGB camera.

large language model, machine learning, pattern recognition, (20 more...)

arXiv.org Artificial Intelligence

2506.06918

Country: Europe > Switzerland (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.69)

Add feedback

Sight Guide: A Wearable Assistive Perception and Navigation System for the Vision Assistance Race in the Cybathlon 2024

Pfreundschuh, Patrick, Cioffi, Giovanni, von Einem, Cornelius, Wyss, Alexander, van de Venn, Hans Wernher, Cadena, Cesar, Scaramuzza, Davide, Siegwart, Roland, Darvishy, Alireza

arXiv.org Artificial IntelligenceJun-4-2025

--Visually impaired individuals face significant challenges navigating and interacting with unknown situations, particularly in tasks requiring spatial awareness and semantic scene understanding. T o accelerate the development and evaluate the state of technologies that enable visually impaired people to solve these tasks, the Vision Assistance Race (VIS) at the Cybathlon 2024 competition was organized. In this work, we present Sight Guide, a wearable assistive system designed for the VIS. The system processes data from multiple RGB and depth cameras on an embedded computer that guides the user through complex, real-world-inspired tasks using vibration signals and audio commands. Our software architecture integrates classical robotics algorithms with learning-based approaches to enable capabilities such as obstacle avoidance, object detection, optical character recognition, and touchscreen interaction. In a testing environment, Sight Guide achieved a 95.7% task success rate, and further demonstrated its effectiveness during the Cybathlon competition. This work provides detailed insights into the system design, evaluation results, and lessons learned, and outlines directions towards a broader real-world applicability. N 2020, approximately 43 million people worldwide were blind, with an additional 295 million suffering from moderate to severe visual impairments [1]. Despite advancements in medical treatments [2], these numbers are projected to rise by 2050 [1]. For individuals with visual impairments, the lack of visual information about their surroundings poses substantial challenges in daily activities. While infrastructure adaptations, such as making public transport more accessible, can mitigate some difficulties, many everyday tasks remain impracticable for blind individuals. To enhance their autonomy, most visually impaired people rely on assistive technologies. Assistive technologies in this context are hardware-and software-based solutions that help people with disabilities to overcome or to reduce barriers in their lives. Although a variety of vision aids leveraging computer vision and artificial intelligence are available on the market, these solutions are typically limited to specific tasks like text-to-speech conversion [3], description of the surrounding [4], or navigation assistance [5], [6].

artificial intelligence, cybathlon 2024, optical character recognition, (15 more...)

arXiv.org Artificial Intelligence

2506.02676

Country:

Europe > Switzerland > Zürich > Zürich (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (0.68)

Technology: Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.88)

Add feedback

QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation

Wasfy, Ahmed, Nacar, Omer, Elkhateb, Abdelakreem, Reda, Mahmoud, Elshehy, Omar, Ammar, Adel, Boulila, Wadii

arXiv.org Artificial IntelligenceJun-4-2025

The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.

machine learning, pattern recognition, recognition, (19 more...)

arXiv.org Artificial Intelligence

2506.02295

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Text Recognition (0.42)

Add feedback

Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak'wala Legacy Texts

Agarwal, Milind, Rosenblum, Daisy, Anastasopoulos, Antonios

arXiv.org Artificial IntelligenceJun-3-2025

Kwak'wala is an Indigenous language spoken in British Columbia, with a rich legacy of published documentation spanning more than a century, and an active community of speakers, teachers, and learners engaged in language revitalization. Over 11 volumes of the earliest texts created during the collaboration between Franz Boas and George Hunt have been scanned but remain unreadable by machines. Complete digitization through optical character recognition has the potential to facilitate transliteration into modern orthographies and the creation of other language technologies. In this paper, we apply the latest OCR techniques to a series of Kwak'wala texts only accessible as images, and discuss the challenges and unique adaptations necessary to make such technologies work for these real-world texts. Building on previous methods, we propose using a mix of off-the-shelf OCR methods, language identification, and masking to effectively isolate Kwak'wala text, along with post-correction models, to produce a final high-quality transcription.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2506.01775

Country:

Europe (1.00)
North America > United States (0.46)
North America > Canada > British Columbia (0.25)

Genre:

Research Report (0.50)
Overview (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.88)

Add feedback

Zero-Shot Text-to-Speech for Vietnamese

Vu, Thi, Nguyen, Linh The, Nguyen, Dat Quoc

arXiv.org Artificial IntelligenceJun-3-2025

This paper introduces PhoAudiobook, a newly curated dataset comprising 941 hours of high-quality audio for Vietnamese text-to-speech. Using PhoAudiobook, we conduct experiments on three leading zero-shot TTS models: VALL-E, VoiceCraft, and XTTS-V2. Our findings demonstrate that PhoAudiobook consistently enhances model performance across various metrics. Moreover, VALL-E and VoiceCraft exhibit superior performance in synthesizing short sentences, highlighting their robustness in handling diverse linguistic contexts. We publicly release PhoAudiobook to facilitate further research and development in Vietnamese text-to-speech.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.01322

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.83)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.74)

Add feedback

Reviews: FastSpeech: Fast, Robust and Controllable Text to Speech

Neural Information Processing SystemsJun-1-2025, 23:53:13 GMT

Originally: Although phoneme duration prediction is widely adopted in conventional TTS systems, jointly training it in a neural TTS model is new. This paper is one of the first works on non-autoregressive text-to-spectrogram modeling. Quality: This paper seems sound overall, expected for a few issues in the comments below. Some of these issues must be addressed before acceptance. Clarity: A well written paper. Significance: The advantages over its autoregressive counterparts are significant, especially for industrial use.

fastspeech, robust and controllable text, tacotron 2, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)
Information Technology > Artificial Intelligence > Assistive Technologies (0.40)

Add feedback

Reviews: FastSpeech: Fast, Robust and Controllable Text to Speech

Neural Information Processing SystemsJun-1-2025, 23:53:02 GMT

The paper proposes a novel non-autoregressive parallelisation approach for mel-spectrogram intermediate representation TTS. The reviewers concur that the paper incorporates two novel explicit components to tts systems - length and duration modules and that the results on Speedup at inference and high-quality audio generations are relevant.

fastspeech, review, robust and controllable text

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)
Information Technology > Artificial Intelligence > Assistive Technologies (0.40)

Add feedback

SHDocs: A dataset, benchmark, and method to efficiently generate high-quality, real-world specular highlight data with near-perfect alignment

Neural Information Processing SystemsMay-27-2025, 12:52:20 GMT

A frequent problem in vision-based reasoning tasks such as object detection and optical character recognition (OCR) is the persistence of specular highlights. Specular highlights appear as bright spots of glare that occur due to the concentrated reflection of light; these spots manifest as image artifacts which occlude computer vision models and are challenging to reconstruct. Despite this, specular highlight removal receives relatively little attention due to the difficulty of acquiring high-quality, real-world data. We introduce a method to generate specular highlight data with near-perfect alignment and present SHDocs--a dataset of specular highlights on document images created using our method. Through our benchmark, we demonstrate that our dataset enables us to surpass the performance of state-of-the-art specular highlight removal models and downstream OCR tasks.

efficiently generate high-quality, real-world specular highlight data, specular highlight data, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.65)

Add feedback