Goto

Collaborating Authors

 Optical Character Recognition


Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech

Kim, Taesoo, Kim, Jinju, Kim, Dongchan, Ko, Jong Hwan, Park, Gyeong-Moon

arXiv.org Artificial Intelligence

The rapid advancement of Zero-Shot Text-to-Speech (ZS-TTS) technology has enabled high-fidelity voice synthesis from minimal audio cues, raising significant privacy and ethical concerns. Despite the threats to voice privacy, research to selectively remove the knowledge to replicate unwanted individual voices from pre-trained model parameters has not been explored. In this paper, we address the new challenge of speaker identity unlearning for ZS-TTS systems. To meet this goal, we propose the first machine unlearning frameworks for ZS-TTS, especially Teacher-Guided Unlearning (TGU), designed to ensure the model forgets designated speaker identities while retaining its ability to generate accurate speech for other speakers. Our proposed methods incorporate randomness to prevent consistent replication of forget speakers' voices, assuring unlearned identities remain untraceable. Additionally, we propose a new evaluation metric, speaker-Zero Retrain Forgetting (spk-ZRF). This assesses the model's ability to disregard prompts associated with forgotten speakers, effectively neutralizing its knowledge of these voices. The experiments conducted on the state-of-the-art model demonstrate that TGU prevents the model from replicating forget speakers' voices while maintaining high quality for other speakers. The demo is available at https://speechunlearn.github.io/


This built-in Windows 11 app can pull the text in any image with one click

PCWorld

Microsoft has added an OCR function (Optical Character Recognition) to the Windows Photos app, which basically means it can now recognize text in an image and instantly extract it for you. To use this feature, open any image that contains words or lines of text using the Photos app. Then, click on the "Scan text" button--which looks like a rounded square with three lines of text inside--located at the bottom of the app window. Once clicked, the Photos app will scan the image and highlight all of the text it finds. You can then interact with it like it's actually text, meaning you can highlight passages with your cursor and right-click to perform actions like copying text, selecting all text, or using Bing Search to look up whatever text you currently have highlighted.


This robot scans rare library books at 2,500 pages per hour

Popular Science

Breakthroughs, discoveries, and DIY tips sent every weekday. For decades, preservationists charged with digitizing rare books have faced an ironic challenge. The whole point of scanning these often one-of-a-kind objects is to keep the delicate manuscripts from harm. To do that, however, required a much more hands-on approach. One of the first solutions was to simply place a tome in a book cradle, then photograph each individual page. In later years, archivists increasingly relied on more advanced top-down document camera arrays.


Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis

Szankin, Maciej, Venkatasamy, Vidhyananth, Ying, Lihang

arXiv.org Artificial Intelligence

Outdoor advertisements remain a critical medium for modern marketing, yet accurately verifying billboard text visibility under real-world conditions is still challenging. Traditional Optical Character Recognition (OCR) pipelines excel at cropped text recognition but often struggle with complex outdoor scenes, varying fonts, and weather-induced visual noise. Recently, multimodal Vision-Language Models (VLMs) have emerged as promising alternatives, offering end-to-end scene understanding with no explicit detection step. This work systematically benchmarks representative VLMs--including Qwen 2.5 VL 3B, InternVL3, and SmolVLM2--against a compact CNN-based OCR baseline (PaddleOCRv4) across two public datasets (ICDAR 2015 and SVT), augmented with synthetic weather distortions to simulate realistic degradation. Our results reveal that while selected VLMs excel at holistic scene reasoning, lightweight CNN pipelines still achieve competitive accuracy for cropped text at a fraction of the computational cost--an important consideration for edge deployment. T o foster future research, we release our weather-augmented benchmark and evaluation code publicly < link provided upon acceptance > .


UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

Glazer, Neta, Navon, Aviv, Segal, Yael, Shamsian, Aviv, Segev, Hilit, Buchnick, Asaf, Pirchi, Menachem, Hetz, Gil, Keshet, Joseph

arXiv.org Artificial Intelligence

Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes. A key challenge is the lack of data with speech and background audio aligned in natural context. To overcome the lack of paired training data, we propose a self-supervised framework that extracts speech, background audio, and transcripts from unannotated recordings. Extensive evaluations demonstrate that UmbraTTS significantly outperformed existing baselines, producing natural, high-quality, environmentally aware audios.


Super Kawaii Vocalics: Amplifying the "Cute" Factor in Computer Voice

Mandai, Yuto, Seaborn, Katie, Nakano, Tomoyasu, Sun, Xin, Wang, Yijia, Kato, Jun

arXiv.org Artificial Intelligence

"Kawaii" is the Japanese concept of cute, which carries sociocultural connotations related to social identities and emotional responses. Yet, virtually all work to date has focused on the visual side of kawaii, including in studies of computer agents and social robots. In pursuit of formalizing the new science of kawaii vocalics, we explored what elements of voice relate to kawaii and how they might be manipulated, manually and automatically. We conducted a four-phase study (grand N = 512) with two varieties of computer voices: text-to-speech (TTS) and game character voices. We found kawaii "sweet spots" through manipulation of fundamental and formant frequencies, but only for certain voices and to a certain extent. Findings also suggest a ceiling effect for the kawaii vocalics of certain voices. We offer empirical validation of the preliminary kawaii vocalics model and an elementary method for manipulating kawaii perceptions of computer voice.


Logios : An open source Greek Polytonic Optical Character Recognition system

Konstantinos, Perifanos, Dionisis, Goutsos

arXiv.org Artificial Intelligence

--In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use. I. Introduction Historical Greek polytonic scripts have a rather complex target vocabulary and various set of rules resulting in a large character set of more than 200 characters, including the acute accent, the grave accent, the circumflex, the rough breathing (dasi pneuma), the smooth breathing (psilon pneuma), the diaeresis and the iota subscript [1].


TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

Minixhofer, Christoph, Klejch, Ondrej, Bell, Peter

arXiv.org Artificial Intelligence

Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.


Connecting Vision and Emissions: A Behavioural AI Approach to Carbon Estimation in Road Design

Mhdawi, Ammar K Al, Nnamoko, Nonso, Raafat, Safanah Mudheher, Al-Mhdawi, M. K. S., Humaidi, Amjad J

arXiv.org Artificial Intelligence

We present an enhanced YOLOv8 real time vehicle detection and classification framework, for estimating carbon emissions in urban environments. The system enhances YOLOv8 architecture to detect, segment, and track vehicles from live traffic video streams. Once a vehicle is localized, a dedicated deep learning-based identification module is employed to recognize license plates and classify vehicle types. Since YOLOv8 lacks the built-in capacity for fine grained recognition tasks such as reading license plates or determining vehicle attributes beyond class labels, our framework incorporates a hybrid pipeline where each detected vehicle is tracked and its bounding box is cropped and passed to a deep Optical Character Recognition (OCR) module. This OCR system, composed of multiple convolutional neural network (CNN) layers, is trained specifically for character-level detection and license plate decoding under varied conditions such as motion blur, occlusion, and diverse font styles. Additionally, the recognized plate information is validated using a real time API that cross references with an external vehicle registration database to ensure accurate classification and emission estimation. This multi-stage approach enables precise, automated calculation of per vehicle carbon emissions. Extensive evaluation was conducted using a diverse vehicle dataset enriched with segmentation masks and annotated license plates. The YOLOv8 detector achieved a mean Average Precision (mAP@0.5) of approximately 71% for bounding boxes and 70% for segmentation masks. Character level OCR accuracy reached up to 99% with the best performing CNN model. These results affirm the feasibility of combining real time object detection with deep OCR for practical deployment in smart transportation systems, offering a scalable solution for automated, vehicle specific carbon emission monitoring.


An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW

Mehta, Prateek, Patil, Anasuya

arXiv.org Artificial Intelligence

Abstract: Knowledge extraction just by listening to sounds is known a s a distinctive property. Visually impaired people are dependent solely on Braille books & audio recordings provided by NGOs. Owing to many constraints in above two approaches blind people can't access the book of their choice. As the speech form is a more effective means of communication than text as blind and visually impaired persons can easily respond to sounds. This paper aims to develop an accurate, reliable, cost effective, and user - friendly optical character recognition (OCR) based speech synthesis system.

  Country:
  Genre: Research Report (0.40)
  Industry: Health & Medicine (0.60)
  Technology: Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)