Goto

Collaborating Authors

 Optical Character Recognition


Improving OCR using internal document redundancy

arXiv.org Artificial Intelligence

Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document's redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.



Calibrated Structured Prediction

Neural Information Processing Systems

In user-facing applications, displaying calibrated confidence measures---probabilities that correspond to true frequency---can be as important as obtaining high accuracy. We are interested in calibration for structured prediction problems such as speech recognition, optical character recognition, and medical diagnosis. Structured prediction presents new challenges for calibration: the output space is large, and users may issue many types of probability queries (e.g., marginals) on the structured output. We extend the notion of calibration so as to handle various subtleties pertaining to the structured setting, and then provide a simple recalibration method that trains a binary classifier to predict probabilities of interest. We explore a range of features appropriate for structured recalibration, and demonstrate their efficacy on three real-world datasets.


Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback

arXiv.org Artificial Intelligence

Diffusion models produce high-fidelity speech but are inefficient for real-time use due to long denoising steps and challenges in modeling intonation and rhythm. To improve this, we propose Diffusion Loss-Guided Policy Optimization (DLPO), an RLHF framework for TTS diffusion models. DLPO integrates the original training loss into the reward function, preserving generative capabilities while reducing inefficiencies. Using naturalness scores as feedback, DLPO aligns reward optimization with the diffusion model's structure, improving speech quality. We evaluate DLPO on WaveGrad 2, a non-autoregressive diffusion-based TTS model. Results show significant improvements in objective metrics (UTMOS 3.65, NISQA 4.02) and subjective evaluations, with DLPO audio preferred 67\% of the time. These findings demonstrate DLPO's potential for efficient, high-quality diffusion TTS in real-time, resource-limited settings.


Finally, a scanner that fits in your pocket and never jams

Popular Science

Instant scans and searchable PDFs are in. SwiftScan VIP puts the power of a full office scanner into your smartphone--and adds some cool extras, too. For a limited time, get lifetime access to SwiftScan for just 41.99 with promo code TAKE30. SwiftScan captures ultra-clear documents with automatic edge detection, color correction, and blur reduction--all starting at 200 dpi. It handles everything from receipts and handwritten notes to contracts, barcodes, business cards, and whiteboards.


Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech

arXiv.org Artificial Intelligence

The rapid advancement of Zero-Shot Text-to-Speech (ZS-TTS) technology has enabled high-fidelity voice synthesis from minimal audio cues, raising significant privacy and ethical concerns. Despite the threats to voice privacy, research to selectively remove the knowledge to replicate unwanted individual voices from pre-trained model parameters has not been explored. In this paper, we address the new challenge of speaker identity unlearning for ZS-TTS systems. To meet this goal, we propose the first machine unlearning frameworks for ZS-TTS, especially Teacher-Guided Unlearning (TGU), designed to ensure the model forgets designated speaker identities while retaining its ability to generate accurate speech for other speakers. Our proposed methods incorporate randomness to prevent consistent replication of forget speakers' voices, assuring unlearned identities remain untraceable. Additionally, we propose a new evaluation metric, speaker-Zero Retrain Forgetting (spk-ZRF). This assesses the model's ability to disregard prompts associated with forgotten speakers, effectively neutralizing its knowledge of these voices. The experiments conducted on the state-of-the-art model demonstrate that TGU prevents the model from replicating forget speakers' voices while maintaining high quality for other speakers. The demo is available at https://speechunlearn.github.io/


This built-in Windows 11 app can pull the text in any image with one click

PCWorld

Microsoft has added an OCR function (Optical Character Recognition) to the Windows Photos app, which basically means it can now recognize text in an image and instantly extract it for you. To use this feature, open any image that contains words or lines of text using the Photos app. Then, click on the "Scan text" button--which looks like a rounded square with three lines of text inside--located at the bottom of the app window. Once clicked, the Photos app will scan the image and highlight all of the text it finds. You can then interact with it like it's actually text, meaning you can highlight passages with your cursor and right-click to perform actions like copying text, selecting all text, or using Bing Search to look up whatever text you currently have highlighted.


This robot scans rare library books at 2,500 pages per hour

Popular Science

Breakthroughs, discoveries, and DIY tips sent every weekday. For decades, preservationists charged with digitizing rare books have faced an ironic challenge. The whole point of scanning these often one-of-a-kind objects is to keep the delicate manuscripts from harm. To do that, however, required a much more hands-on approach. One of the first solutions was to simply place a tome in a book cradle, then photograph each individual page. In later years, archivists increasingly relied on more advanced top-down document camera arrays.


Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis

arXiv.org Artificial Intelligence

Outdoor advertisements remain a critical medium for modern marketing, yet accurately verifying billboard text visibility under real-world conditions is still challenging. Traditional Optical Character Recognition (OCR) pipelines excel at cropped text recognition but often struggle with complex outdoor scenes, varying fonts, and weather-induced visual noise. Recently, multimodal Vision-Language Models (VLMs) have emerged as promising alternatives, offering end-to-end scene understanding with no explicit detection step. This work systematically benchmarks representative VLMs--including Qwen 2.5 VL 3B, InternVL3, and SmolVLM2--against a compact CNN-based OCR baseline (PaddleOCRv4) across two public datasets (ICDAR 2015 and SVT), augmented with synthetic weather distortions to simulate realistic degradation. Our results reveal that while selected VLMs excel at holistic scene reasoning, lightweight CNN pipelines still achieve competitive accuracy for cropped text at a fraction of the computational cost--an important consideration for edge deployment. T o foster future research, we release our weather-augmented benchmark and evaluation code publicly < link provided upon acceptance > .


UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

arXiv.org Artificial Intelligence

Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes. A key challenge is the lack of data with speech and background audio aligned in natural context. To overcome the lack of paired training data, we propose a self-supervised framework that extracts speech, background audio, and transcripts from unannotated recordings. Extensive evaluations demonstrate that UmbraTTS significantly outperformed existing baselines, producing natural, high-quality, environmentally aware audios.