Optical Character Recognition
Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech
Kim, Taesoo, Kim, Jinju, Kim, Dongchan, Ko, Jong Hwan, Park, Gyeong-Moon
The rapid advancement of Zero-Shot Text-to-Speech (ZS-TTS) technology has enabled high-fidelity voice synthesis from minimal audio cues, raising significant privacy and ethical concerns. Despite the threats to voice privacy, research to selectively remove the knowledge to replicate unwanted individual voices from pre-trained model parameters has not been explored. In this paper, we address the new challenge of speaker identity unlearning for ZS-TTS systems. To meet this goal, we propose the first machine unlearning frameworks for ZS-TTS, especially Teacher-Guided Unlearning (TGU), designed to ensure the model forgets designated speaker identities while retaining its ability to generate accurate speech for other speakers. Our proposed methods incorporate randomness to prevent consistent replication of forget speakers' voices, assuring unlearned identities remain untraceable. Additionally, we propose a new evaluation metric, speaker-Zero Retrain Forgetting (spk-ZRF). This assesses the model's ability to disregard prompts associated with forgotten speakers, effectively neutralizing its knowledge of these voices. The experiments conducted on the state-of-the-art model demonstrate that TGU prevents the model from replicating forget speakers' voices while maintaining high quality for other speakers. The demo is available at https://speechunlearn.github.io/
- North America > Canada > Quebec > Montreal (0.04)
- Europe > United Kingdom > North Sea > Southern North Sea (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Africa > Guinea > Kankan Region > Kankan Prefecture > Kankan (0.04)
- Research Report > Promising Solution (0.49)
- Research Report > Experimental Study (0.46)
- Research Report > New Finding (0.46)
- Information Technology > Security & Privacy (1.00)
- Law (0.93)
This built-in Windows 11 app can pull the text in any image with one click
Microsoft has added an OCR function (Optical Character Recognition) to the Windows Photos app, which basically means it can now recognize text in an image and instantly extract it for you. To use this feature, open any image that contains words or lines of text using the Photos app. Then, click on the "Scan text" button--which looks like a rounded square with three lines of text inside--located at the bottom of the app window. Once clicked, the Photos app will scan the image and highlight all of the text it finds. You can then interact with it like it's actually text, meaning you can highlight passages with your cursor and right-click to perform actions like copying text, selecting all text, or using Bing Search to look up whatever text you currently have highlighted.
This robot scans rare library books at 2,500 pages per hour
Breakthroughs, discoveries, and DIY tips sent every weekday. For decades, preservationists charged with digitizing rare books have faced an ironic challenge. The whole point of scanning these often one-of-a-kind objects is to keep the delicate manuscripts from harm. To do that, however, required a much more hands-on approach. One of the first solutions was to simply place a tome in a book cradle, then photograph each individual page. In later years, archivists increasingly relied on more advanced top-down document camera arrays.
- Information Technology > Artificial Intelligence > Robots (0.47)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis
Szankin, Maciej, Venkatasamy, Vidhyananth, Ying, Lihang
Outdoor advertisements remain a critical medium for modern marketing, yet accurately verifying billboard text visibility under real-world conditions is still challenging. Traditional Optical Character Recognition (OCR) pipelines excel at cropped text recognition but often struggle with complex outdoor scenes, varying fonts, and weather-induced visual noise. Recently, multimodal Vision-Language Models (VLMs) have emerged as promising alternatives, offering end-to-end scene understanding with no explicit detection step. This work systematically benchmarks representative VLMs--including Qwen 2.5 VL 3B, InternVL3, and SmolVLM2--against a compact CNN-based OCR baseline (PaddleOCRv4) across two public datasets (ICDAR 2015 and SVT), augmented with synthetic weather distortions to simulate realistic degradation. Our results reveal that while selected VLMs excel at holistic scene reasoning, lightweight CNN pipelines still achieve competitive accuracy for cropped text at a fraction of the computational cost--an important consideration for edge deployment. T o foster future research, we release our weather-augmented benchmark and evaluation code publicly < link provided upon acceptance > .
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.87)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.87)
UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching
Glazer, Neta, Navon, Aviv, Segal, Yael, Shamsian, Aviv, Segev, Hilit, Buchnick, Asaf, Pirchi, Menachem, Hetz, Gil, Keshet, Joseph
Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes. A key challenge is the lack of data with speech and background audio aligned in natural context. To overcome the lack of paired training data, we propose a self-supervised framework that extracts speech, background audio, and transcripts from unannotated recordings. Extensive evaluations demonstrate that UmbraTTS significantly outperformed existing baselines, producing natural, high-quality, environmentally aware audios.
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)
Super Kawaii Vocalics: Amplifying the "Cute" Factor in Computer Voice
Mandai, Yuto, Seaborn, Katie, Nakano, Tomoyasu, Sun, Xin, Wang, Yijia, Kato, Jun
"Kawaii" is the Japanese concept of cute, which carries sociocultural connotations related to social identities and emotional responses. Yet, virtually all work to date has focused on the visual side of kawaii, including in studies of computer agents and social robots. In pursuit of formalizing the new science of kawaii vocalics, we explored what elements of voice relate to kawaii and how they might be manipulated, manually and automatically. We conducted a four-phase study (grand N = 512) with two varieties of computer voices: text-to-speech (TTS) and game character voices. We found kawaii "sweet spots" through manipulation of fundamental and formant frequencies, but only for certain voices and to a certain extent. Findings also suggest a ceiling effect for the kawaii vocalics of certain voices. We offer empirical validation of the preliminary kawaii vocalics model and an elementary method for manipulating kawaii perceptions of computer voice.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > United Kingdom > England > Greater London > London (0.14)
- Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.06)
- (15 more...)
- Questionnaire & Opinion Survey (1.00)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)
- Media > Music (0.93)
- Health & Medicine (0.88)
- Leisure & Entertainment > Games > Computer Games (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.46)
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.34)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.34)
- Information Technology > Artificial Intelligence > Robots > Robots in the Home (0.34)
Logios : An open source Greek Polytonic Optical Character Recognition system
Konstantinos, Perifanos, Dionisis, Goutsos
--In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use. I. Introduction Historical Greek polytonic scripts have a rather complex target vocabulary and various set of rules resulting in a large character set of more than 200 characters, including the acute accent, the grave accent, the circumflex, the rough breathing (dasi pneuma), the smooth breathing (psilon pneuma), the diaeresis and the iota subscript [1].
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
Minixhofer, Christoph, Klejch, Ondrej, Bell, Peter
Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.81)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Connecting Vision and Emissions: A Behavioural AI Approach to Carbon Estimation in Road Design
Mhdawi, Ammar K Al, Nnamoko, Nonso, Raafat, Safanah Mudheher, Al-Mhdawi, M. K. S., Humaidi, Amjad J
We present an enhanced YOLOv8 real time vehicle detection and classification framework, for estimating carbon emissions in urban environments. The system enhances YOLOv8 architecture to detect, segment, and track vehicles from live traffic video streams. Once a vehicle is localized, a dedicated deep learning-based identification module is employed to recognize license plates and classify vehicle types. Since YOLOv8 lacks the built-in capacity for fine grained recognition tasks such as reading license plates or determining vehicle attributes beyond class labels, our framework incorporates a hybrid pipeline where each detected vehicle is tracked and its bounding box is cropped and passed to a deep Optical Character Recognition (OCR) module. This OCR system, composed of multiple convolutional neural network (CNN) layers, is trained specifically for character-level detection and license plate decoding under varied conditions such as motion blur, occlusion, and diverse font styles. Additionally, the recognized plate information is validated using a real time API that cross references with an external vehicle registration database to ensure accurate classification and emission estimation. This multi-stage approach enables precise, automated calculation of per vehicle carbon emissions. Extensive evaluation was conducted using a diverse vehicle dataset enriched with segmentation masks and annotated license plates. The YOLOv8 detector achieved a mean Average Precision (mAP@0.5) of approximately 71% for bounding boxes and 70% for segmentation masks. Character level OCR accuracy reached up to 99% with the best performing CNN model. These results affirm the feasibility of combining real time object detection with deep OCR for practical deployment in smart transportation systems, offering a scalable solution for automated, vehicle specific carbon emission monitoring.
- North America > United States (0.46)
- Europe > United Kingdom (0.14)
- Asia > China (0.14)
- (5 more...)
- Transportation > Ground > Road (1.00)
- Energy (1.00)
- Government > Regional Government > North America Government > United States Government (0.46)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW
Mehta, Prateek, Patil, Anasuya
Abstract: Knowledge extraction just by listening to sounds is known a s a distinctive property. Visually impaired people are dependent solely on Braille books & audio recordings provided by NGOs. Owing to many constraints in above two approaches blind people can't access the book of their choice. As the speech form is a more effective means of communication than text as blind and visually impaired persons can easily respond to sounds. This paper aims to develop an accurate, reliable, cost effective, and user - friendly optical character recognition (OCR) based speech synthesis system.
- Oceania > Australia > South Australia > Adelaide (0.04)
- Asia > Middle East > Oman (0.04)
- Asia > India > Tamil Nadu > Chennai (0.04)
- Asia > India > Jharkhand > Ranchi (0.04)