Optical Character Recognition
Super Kawaii Vocalics: Amplifying the "Cute" Factor in Computer Voice
Mandai, Yuto, Seaborn, Katie, Nakano, Tomoyasu, Sun, Xin, Wang, Yijia, Kato, Jun
"Kawaii" is the Japanese concept of cute, which carries sociocultural connotations related to social identities and emotional responses. Yet, virtually all work to date has focused on the visual side of kawaii, including in studies of computer agents and social robots. In pursuit of formalizing the new science of kawaii vocalics, we explored what elements of voice relate to kawaii and how they might be manipulated, manually and automatically. We conducted a four-phase study (grand N = 512) with two varieties of computer voices: text-to-speech (TTS) and game character voices. We found kawaii "sweet spots" through manipulation of fundamental and formant frequencies, but only for certain voices and to a certain extent. Findings also suggest a ceiling effect for the kawaii vocalics of certain voices. We offer empirical validation of the preliminary kawaii vocalics model and an elementary method for manipulating kawaii perceptions of computer voice.
Logios : An open source Greek Polytonic Optical Character Recognition system
Konstantinos, Perifanos, Dionisis, Goutsos
--In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use. I. Introduction Historical Greek polytonic scripts have a rather complex target vocabulary and various set of rules resulting in a large character set of more than 200 characters, including the acute accent, the grave accent, the circumflex, the rough breathing (dasi pneuma), the smooth breathing (psilon pneuma), the diaeresis and the iota subscript [1].
TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
Minixhofer, Christoph, Klejch, Ondrej, Bell, Peter
Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.
Connecting Vision and Emissions: A Behavioural AI Approach to Carbon Estimation in Road Design
Mhdawi, Ammar K Al, Nnamoko, Nonso, Raafat, Safanah Mudheher, Al-Mhdawi, M. K. S., Humaidi, Amjad J
We present an enhanced YOLOv8 real time vehicle detection and classification framework, for estimating carbon emissions in urban environments. The system enhances YOLOv8 architecture to detect, segment, and track vehicles from live traffic video streams. Once a vehicle is localized, a dedicated deep learning-based identification module is employed to recognize license plates and classify vehicle types. Since YOLOv8 lacks the built-in capacity for fine grained recognition tasks such as reading license plates or determining vehicle attributes beyond class labels, our framework incorporates a hybrid pipeline where each detected vehicle is tracked and its bounding box is cropped and passed to a deep Optical Character Recognition (OCR) module. This OCR system, composed of multiple convolutional neural network (CNN) layers, is trained specifically for character-level detection and license plate decoding under varied conditions such as motion blur, occlusion, and diverse font styles. Additionally, the recognized plate information is validated using a real time API that cross references with an external vehicle registration database to ensure accurate classification and emission estimation. This multi-stage approach enables precise, automated calculation of per vehicle carbon emissions. Extensive evaluation was conducted using a diverse vehicle dataset enriched with segmentation masks and annotated license plates. The YOLOv8 detector achieved a mean Average Precision (mAP@0.5) of approximately 71% for bounding boxes and 70% for segmentation masks. Character level OCR accuracy reached up to 99% with the best performing CNN model. These results affirm the feasibility of combining real time object detection with deep OCR for practical deployment in smart transportation systems, offering a scalable solution for automated, vehicle specific carbon emission monitoring.
Optimizing Multilingual Text-To-Speech with Accents & Emotions
Pawar, Pranav, Dwivedi, Akshansh, Boricha, Jenish, Gohil, Himanshu, Dubey, Aditya
State-of-the-art text-to-speech (TTS) systems realize high naturalness in monolingual environments, synthesizing speech with correct multilingual accents (especially for Indic languages) and context-relevant emotions still poses difficulty owing to cultural nuance discrepancies in current frameworks. This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling, in particularly tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture, and culture-sensitive emotion embedding layers trained on native speaker corpora, as well as incorporating a dynamic accent code switching with residual vector quantization. Quantitative tests demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction from 15.4% to 11.8%) and 85.3% emotion recognition accuracy from native listeners, surpassing METTS and VECL-TTS baselines. The novelty of the system is that it can mix code in real time - generating statements such as "Namaste, let's talk about
An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW
Mehta, Prateek, Patil, Anasuya
Abstract: Knowledge extraction just by listening to sounds is known a s a distinctive property. Visually impaired people are dependent solely on Braille books & audio recordings provided by NGOs. Owing to many constraints in above two approaches blind people can't access the book of their choice. As the speech form is a more effective means of communication than text as blind and visually impaired persons can easily respond to sounds. This paper aims to develop an accurate, reliable, cost effective, and user - friendly optical character recognition (OCR) based speech synthesis system.
The best part of the future is finally having a permanent replacement for this annoying technology
We don't have flying cars, jetpacks haven't replaced walking, and I have not seen a single sign that we're all pivoting to wearing matching silver jumpsuits. The future is kind of lame. SwiftScan VIP is a scanner tool that basically replaces half of your old office equipment with an app that works on iOS and Android devices. It's also a lot cheaper than some desktop scanners, and you don't need to replace it every few years. During this limited-time sale, you can get a SwiftScan VIP Lifetime Subscription for only 41.99 (it's usually 199.99).
Reading in the Dark with Foveated Event Vision
Brander, Carl, Cioffi, Giovanni, Messikommer, Nico, Scaramuzza, Davide
Current smart glasses equipped with RGB cameras struggle to perceive the environment in low-light and high-speed motion scenarios due to motion blur and the limited dynamic range of frame cameras. Additionally, capturing dense images with a frame camera requires large bandwidth and power consumption, consequently draining the battery faster . These challenges are especially relevant for developing algorithms that can read text from images. In this work, we propose a novel event-based Optical Character Recognition (OCR) approach for smart glasses. By using the eye gaze of the user, we foveate the event stream to significantly reduce bandwidth by around 98% while exploiting the benefits of event cameras in high-dynamic and fast scenes. Our proposed method performs deep binary reconstruction trained on synthetic data and leverages multi-modal LLMs for OCR, outperforming traditional OCR solutions. Our results demonstrate the ability to read text in low light environments where RGB cameras struggle while using up to 2'400 times less bandwidth than a wearable RGB camera.
Sight Guide: A Wearable Assistive Perception and Navigation System for the Vision Assistance Race in the Cybathlon 2024
Pfreundschuh, Patrick, Cioffi, Giovanni, von Einem, Cornelius, Wyss, Alexander, van de Venn, Hans Wernher, Cadena, Cesar, Scaramuzza, Davide, Siegwart, Roland, Darvishy, Alireza
--Visually impaired individuals face significant challenges navigating and interacting with unknown situations, particularly in tasks requiring spatial awareness and semantic scene understanding. T o accelerate the development and evaluate the state of technologies that enable visually impaired people to solve these tasks, the Vision Assistance Race (VIS) at the Cybathlon 2024 competition was organized. In this work, we present Sight Guide, a wearable assistive system designed for the VIS. The system processes data from multiple RGB and depth cameras on an embedded computer that guides the user through complex, real-world-inspired tasks using vibration signals and audio commands. Our software architecture integrates classical robotics algorithms with learning-based approaches to enable capabilities such as obstacle avoidance, object detection, optical character recognition, and touchscreen interaction. In a testing environment, Sight Guide achieved a 95.7% task success rate, and further demonstrated its effectiveness during the Cybathlon competition. This work provides detailed insights into the system design, evaluation results, and lessons learned, and outlines directions towards a broader real-world applicability. N 2020, approximately 43 million people worldwide were blind, with an additional 295 million suffering from moderate to severe visual impairments [1]. Despite advancements in medical treatments [2], these numbers are projected to rise by 2050 [1]. For individuals with visual impairments, the lack of visual information about their surroundings poses substantial challenges in daily activities. While infrastructure adaptations, such as making public transport more accessible, can mitigate some difficulties, many everyday tasks remain impracticable for blind individuals. To enhance their autonomy, most visually impaired people rely on assistive technologies. Assistive technologies in this context are hardware-and software-based solutions that help people with disabilities to overcome or to reduce barriers in their lives. Although a variety of vision aids leveraging computer vision and artificial intelligence are available on the market, these solutions are typically limited to specific tasks like text-to-speech conversion [3], description of the surrounding [4], or navigation assistance [5], [6].
QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation
Wasfy, Ahmed, Nacar, Omer, Elkhateb, Abdelakreem, Reda, Mahmoud, Elshehy, Omar, Ammar, Adel, Boulila, Wadii
The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.