AITopics | Optical Character Recognition

Collaborating Authors

Optical Character Recognition

Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.

News Overviews Instructional Materials AI-Alerts Classics

Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech

Kim, Taesoo, Kim, Jinju, Kim, Dongchan, Ko, Jong Hwan, Park, Gyeong-Moon

arXiv.org Artificial IntelligenceJul-29-2025

The rapid advancement of Zero-Shot Text-to-Speech (ZS-TTS) technology has enabled high-fidelity voice synthesis from minimal audio cues, raising significant privacy and ethical concerns. Despite the threats to voice privacy, research to selectively remove the knowledge to replicate unwanted individual voices from pre-trained model parameters has not been explored. In this paper, we address the new challenge of speaker identity unlearning for ZS-TTS systems. To meet this goal, we propose the first machine unlearning frameworks for ZS-TTS, especially Teacher-Guided Unlearning (TGU), designed to ensure the model forgets designated speaker identities while retaining its ability to generate accurate speech for other speakers. Our proposed methods incorporate randomness to prevent consistent replication of forget speakers' voices, assuring unlearned identities remain untraceable. Additionally, we propose a new evaluation metric, speaker-Zero Retrain Forgetting (spk-ZRF). This assesses the model's ability to disregard prompts associated with forgotten speakers, effectively neutralizing its knowledge of these voices. The experiments conducted on the state-of-the-art model demonstrate that TGU prevents the model from replicating forget speakers' voices while maintaining high quality for other speakers. The demo is available at https://speechunlearn.github.io/

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.2014

Country: North America > Canada (0.28)

Genre:

Research Report > Promising Solution (0.49)
Research Report > Experimental Study (0.46)
Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)
(2 more...)

Add feedback

This built-in Windows 11 app can pull the text in any image with one click

PCWorldJul-25-2025, 16:44:08 GMT

Microsoft has added an OCR function (Optical Character Recognition) to the Windows Photos app, which basically means it can now recognize text in an image and instantly extract it for you. To use this feature, open any image that contains words or lines of text using the Photos app. Then, click on the "Scan text" button--which looks like a rounded square with three lines of text inside--located at the bottom of the app window. Once clicked, the Photos app will scan the image and highlight all of the text it finds. You can then interact with it like it's actually text, meaning you can highlight passages with your cursor and right-click to perform actions like copying text, selecting all text, or using Bing Search to look up whatever text you currently have highlighted.

artificial intelligence, built-in window 11, optical character recognition, (1 more...)

PCWorld

Technology: Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)

Add feedback

This robot scans rare library books at 2,500 pages per hour

Breakthroughs, discoveries, and DIY tips sent every weekday. For decades, preservationists charged with digitizing rare books have faced an ironic challenge. The whole point of scanning these often one-of-a-kind objects is to keep the delicate manuscripts from harm. To do that, however, required a much more hands-on approach. One of the first solutions was to simply place a tome in a book cradle, then photograph each individual page. In later years, archivists increasingly relied on more advanced top-down document camera arrays.

artificial intelligence, optical character recognition, robot scan rare library book, (3 more...)

Popular Science

Technology:

Information Technology > Artificial Intelligence > Robots (0.47)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)

Add feedback

Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis

Szankin, Maciej, Venkatasamy, Vidhyananth, Ying, Lihang

arXiv.org Artificial IntelligenceJul-17-2025

Outdoor advertisements remain a critical medium for modern marketing, yet accurately verifying billboard text visibility under real-world conditions is still challenging. Traditional Optical Character Recognition (OCR) pipelines excel at cropped text recognition but often struggle with complex outdoor scenes, varying fonts, and weather-induced visual noise. Recently, multimodal Vision-Language Models (VLMs) have emerged as promising alternatives, offering end-to-end scene understanding with no explicit detection step. This work systematically benchmarks representative VLMs--including Qwen 2.5 VL 3B, InternVL3, and SmolVLM2--against a compact CNN-based OCR baseline (PaddleOCRv4) across two public datasets (ICDAR 2015 and SVT), augmented with synthetic weather distortions to simulate realistic degradation. Our results reveal that while selected VLMs excel at holistic scene reasoning, lightweight CNN pipelines still achieve competitive accuracy for cropped text at a fraction of the computational cost--an important consideration for edge deployment. T o foster future research, we release our weather-augmented benchmark and evaluation code publicly < link provided upon acceptance > .

large language model, machine learning, pattern recognition, (21 more...)

arXiv.org Artificial Intelligence

2507.1173

Genre: Research Report > New Finding (0.48)

Industry: Marketing (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.87)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.87)

Add feedback

UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

Glazer, Neta, Navon, Aviv, Segal, Yael, Shamsian, Aviv, Segev, Hilit, Buchnick, Asaf, Pirchi, Menachem, Hetz, Gil, Keshet, Joseph

arXiv.org Artificial IntelligenceJul-14-2025

Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes. A key challenge is the lack of data with speech and background audio aligned in natural context. To overcome the lack of paired training data, we propose a self-supervised framework that extracts speech, background audio, and transcripts from unannotated recordings. Extensive evaluations demonstrate that UmbraTTS significantly outperformed existing baselines, producing natural, high-quality, environmentally aware audios.

artificial intelligence, machine learning, optical character recognition, (16 more...)

arXiv.org Artificial Intelligence

2506.09874

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)

Add feedback

Super Kawaii Vocalics: Amplifying the "Cute" Factor in Computer Voice

Mandai, Yuto, Seaborn, Katie, Nakano, Tomoyasu, Sun, Xin, Wang, Yijia, Kato, Jun

arXiv.org Artificial IntelligenceJul-10-2025

"Kawaii" is the Japanese concept of cute, which carries sociocultural connotations related to social identities and emotional responses. Yet, virtually all work to date has focused on the visual side of kawaii, including in studies of computer agents and social robots. In pursuit of formalizing the new science of kawaii vocalics, we explored what elements of voice relate to kawaii and how they might be manipulated, manually and automatically. We conducted a four-phase study (grand N = 512) with two varieties of computer voices: text-to-speech (TTS) and game character voices. We found kawaii "sweet spots" through manipulation of fundamental and formant frequencies, but only for certain voices and to a certain extent. Findings also suggest a ceiling effect for the kawaii vocalics of certain voices. We offer empirical validation of the preliminary kawaii vocalics model and an elementary method for manipulating kawaii perceptions of computer voice.

artificial intelligence, frequency, natural language, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3706598.3713709

2507.06235

Country:

North America > United States (0.95)
Europe > United Kingdom > England (0.46)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > New Finding (0.93)
Research Report > Experimental Study (0.93)

Industry:

Media > Music (0.93)
Health & Medicine (0.88)
Leisure & Entertainment > Games > Computer Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.46)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.34)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.34)
Information Technology > Artificial Intelligence > Robots > Robots in the Home (0.34)

Add feedback

Logios : An open source Greek Polytonic Optical Character Recognition system

Konstantinos, Perifanos, Dionisis, Goutsos

arXiv.org Artificial IntelligenceJun-27-2025

--In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use. I. Introduction Historical Greek polytonic scripts have a rather complex target vocabulary and various set of rules resulting in a large character set of more than 200 characters, including the acute accent, the grave accent, the circumflex, the rough breathing (dasi pneuma), the smooth breathing (psilon pneuma), the diaeresis and the iota subscript [1].

machine learning, pattern recognition, recognition, (15 more...)

arXiv.org Artificial Intelligence

2506.21474

Country: Europe > Greece > Attica > Athens (0.05)

Genre: Research Report (0.53)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

Minixhofer, Christoph, Klejch, Ondrej, Bell, Peter

arXiv.org Artificial IntelligenceJun-25-2025

Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.

artificial intelligence, correlation, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2506.19441

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Connecting Vision and Emissions: A Behavioural AI Approach to Carbon Estimation in Road Design

Mhdawi, Ammar K Al, Nnamoko, Nonso, Raafat, Safanah Mudheher, Al-Mhdawi, M. K. S., Humaidi, Amjad J

arXiv.org Artificial IntelligenceJun-25-2025

We present an enhanced YOLOv8 real time vehicle detection and classification framework, for estimating carbon emissions in urban environments. The system enhances YOLOv8 architecture to detect, segment, and track vehicles from live traffic video streams. Once a vehicle is localized, a dedicated deep learning-based identification module is employed to recognize license plates and classify vehicle types. Since YOLOv8 lacks the built-in capacity for fine grained recognition tasks such as reading license plates or determining vehicle attributes beyond class labels, our framework incorporates a hybrid pipeline where each detected vehicle is tracked and its bounding box is cropped and passed to a deep Optical Character Recognition (OCR) module. This OCR system, composed of multiple convolutional neural network (CNN) layers, is trained specifically for character-level detection and license plate decoding under varied conditions such as motion blur, occlusion, and diverse font styles. Additionally, the recognized plate information is validated using a real time API that cross references with an external vehicle registration database to ensure accurate classification and emission estimation. This multi-stage approach enables precise, automated calculation of per vehicle carbon emissions. Extensive evaluation was conducted using a diverse vehicle dataset enriched with segmentation masks and annotated license plates. The YOLOv8 detector achieved a mean Average Precision (mAP@0.5) of approximately 71% for bounding boxes and 70% for segmentation masks. Character level OCR accuracy reached up to 99% with the best performing CNN model. These results affirm the feasibility of combining real time object detection with deep OCR for practical deployment in smart transportation systems, offering a scalable solution for automated, vehicle specific carbon emission monitoring.

artificial intelligence, machine learning, optical character recognition, (15 more...)

arXiv.org Artificial Intelligence

2506.18924

Country:

North America > United States (0.46)
Europe > United Kingdom (0.14)
Asia > China (0.14)
(5 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation > Ground > Road (1.00)
Energy (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Optimizing Multilingual Text-To-Speech with Accents & Emotions

Pawar, Pranav, Dwivedi, Akshansh, Boricha, Jenish, Gohil, Himanshu, Dubey, Aditya

arXiv.org Artificial IntelligenceJun-23-2025

State-of-the-art text-to-speech (TTS) systems realize high naturalness in monolingual environments, synthesizing speech with correct multilingual accents (especially for Indic languages) and context-relevant emotions still poses difficulty owing to cultural nuance discrepancies in current frameworks. This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling, in particularly tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture, and culture-sensitive emotion embedding layers trained on native speaker corpora, as well as incorporating a dynamic accent code switching with residual vector quantization. Quantitative tests demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction from 15.4% to 11.8%) and 85.3% emotion recognition accuracy from native listeners, surpassing METTS and VECL-TTS baselines. The novelty of the system is that it can mix code in real time - generating statements such as "Namaste, let's talk about " with uninterrupted accent shifts while preserving emotional consistency. Subjective evaluation with 200 users reported a mean opinion score (MOS) of 4.2/5 for cultural correctness, much better than existing multilingual systems (p<0.01). This research makes cross-lingual synthesis more feasible by showcasing scalable accent-emotion disentanglement, with direct application in South Asian EdTech and accessibility software.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.1631

Genre: Research Report > New Finding (0.34)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)

Add feedback