Optical Character Recognition
E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition
--Optical Character Recognition (OCR) in multilingual, noisy, and diverse real-world images remains a significant challenge for optical character recognition systems. With the rise of Large Vision-Language Models (L VLMs), there is growing interest in their ability to generalize and reason beyond fixed OCR pipelines. In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments. We present a large-scale comparative evaluation of five state-of-the-art L VLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and two traditional OCR systems (Sprinklr-Edge-OCR, SuryaOCR) on a proprietary, doubly hand annotated dataset of multilingual (54 languages) images. Our benchmark covers a broad range of metrics including accuracy, semantic consistency, language coverage, computational efficiency (latency, memory, GPU usage), and deployment cost. T o better reflect real-world applicability, we also conducted edge case deployment analysis, evaluating model performance on CPU only environments. Among the results, Qwen achieved the highest precision (0.54), while Sprinklr-Edge-OCR delivered the best overall F1 score (0.46) and outperformed others in efficiency, processing images 35 faster (0.17 seconds per image on average) and at less than 0.01 of the cost (0.006 Our findings demonstrate that the most optimal OCR systems for edge deployment are the traditional ones even in the era of LLMs due to their low compute requirements, low latency, and very high affordability. Optical Character Recognition (OCR) is a cornerstone technology for digitizing documents, automating data entry, and extracting information from images containing typed, handwritten, or printed text.
- Information Technology (0.46)
- Law (0.40)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation
Wang, Kesen, Toibazar, Daulet, Moreno, Pedro J.
We present an end-to-end, self-evolving adversarial workflow for long-context Question-Answer (QA) Generation in Arabic. By orchestrating multiple specialized LVLMs: a question generator, an evaluator, and a swarm of answer generators, our system iteratively refines its own performance without any human intervention. Starting from raw, multi-page Arabic documents across diverse domains, the question generator produces fine-grained, context-aware queries to be tackled by the answer generator swarm, and the evaluator assesses and feeds back quality metrics. This closed-loop cycle enables continuous learning: low-confidence outputs trigger automated re-generation and model updates, progressively enhancing question difficulty and relevance. Moreover, we set the quality metrics as a tunable hyperparameter, enabling question generation at controllable and customizable difficulty levels. We release AraLongBench, a large-scale Arabic benchmark of single- and multi-page challenges spanning hundreds of pages, and demonstrate that our self-evolving workflow substantially outperform static pipelines, markedly boosting the long-context comprehension capabilities of leading Arabic Large Vision Language Models (LVLMs). Lastly, we also meticulously architect a fully automated agentic workflow for long-context Arabic document collection.
- Research Report (0.82)
- Workflow (0.77)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
- (3 more...)
Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR
Vempati, Shashank, Anand, Nishit, Talebailkar, Gaurav, Garai, Arpan, Arora, Chetan
Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to sequence translation in last decade led to modern techniques first detecting words and then inputting one word at a time to a model to directly output full words as sequence of characters. This allowed better utilization of language models and bypass error-prone character segmentation step. We observe that the above transition in style has moved the bottleneck in accuracy to word segmentation. Hence, in this paper, we propose a natural and logical progression from word level OCR to line-level OCR. The proposal allows to bypass errors in word detection, and provides larger sentence context for better utilization of language models. We show that the proposed technique not only improves the accuracy but also efficiency of OCR. Despite our thorough literature survey, we did not find any public dataset to train and benchmark such shift from word to line-level OCR. Hence, we also contribute a meticulously curated dataset of 251 English page images with line-level annotations. Our experimentation revealed a notable end-to-end accuracy improvement of 5.4%, underscoring the potential benefits of transitioning towards line-level OCR, especially for document images. We also report a 4 times improvement in efficiency compared to word-based pipelines. With continuous improvements in large language models, our methodology also holds potential to exploit such advances. Project Website: https://nishitanand.github.io/line-level-ocr-website
- Europe > Switzerland (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.90)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
A Sobel-Gradient MLP Baseline for Handwritten Character Recognition
We revisit the classical Sobel operator to ask a simple question: Are first-order edge maps sufficient to drive an all-dense multilayer perceptron (MLP) for handwritten character recognition (HCR), as an alternative to convolutional neural networks (CNNs)? Using only horizontal and vertical Sobel derivatives as input, we train an MLP on MNIST and EMNIST Letters. Despite its extreme simplicity, the resulting network reaches 98% accuracy on MNIST digits and 92% on EMNIST letters -- approaching CNNs while offering a smaller memory footprint and transparent features. Our findings highlight that much of the class-discriminative information in handwritten character images is already captured by first-order gradients, making edge-aware MLPs a compelling option for HCR.
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.57)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)
Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil
Jayatilleke, Nevidu, de Silva, Nisansa
Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.
- Asia > Sri Lanka (0.05)
- Asia > Singapore (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (11 more...)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
- (2 more...)
Improving OCR using internal document redundancy
Belzarena, Diego, Mowlavi, Seginus, Artola, Aitor, Mariño, Camilo, Gardella, Marina, Ramírez, Ignacio, Tadros, Antoine, He, Roy, Bottaioli, Natalia, Rajaei, Boshra, Randall, Gregory, Morel, Jean-Michel
Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document's redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.
- South America > Uruguay (0.14)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Asia > Middle East > Iran > Razavi Khorasan Province > Mashhad (0.04)
- Asia > China > Hong Kong > Kowloon (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.87)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.77)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.65)
Calibrated Structured Prediction
In user-facing applications, displaying calibrated confidence measures---probabilities that correspond to true frequency---can be as important as obtaining high accuracy. We are interested in calibration for structured prediction problems such as speech recognition, optical character recognition, and medical diagnosis. Structured prediction presents new challenges for calibration: the output space is large, and users may issue many types of probability queries (e.g., marginals) on the structured output. We extend the notion of calibration so as to handle various subtleties pertaining to the structured setting, and then provide a simple recalibration method that trains a binary classifier to predict probabilities of interest. We explore a range of features appropriate for structured recalibration, and demonstrate their efficacy on three real-world datasets.
- Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.93)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)
Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback
Chen, Jingyi, Byun, Ju Seung, Elsner, Micha, Wang, Pichao, Perrault, Andrew
Diffusion models produce high-fidelity speech but are inefficient for real-time use due to long denoising steps and challenges in modeling intonation and rhythm. To improve this, we propose Diffusion Loss-Guided Policy Optimization (DLPO), an RLHF framework for TTS diffusion models. DLPO integrates the original training loss into the reward function, preserving generative capabilities while reducing inefficiencies. Using naturalness scores as feedback, DLPO aligns reward optimization with the diffusion model's structure, improving speech quality. We evaluate DLPO on WaveGrad 2, a non-autoregressive diffusion-based TTS model. Results show significant improvements in objective metrics (UTMOS 3.65, NISQA 4.02) and subjective evaluations, with DLPO audio preferred 67\% of the time. These findings demonstrate DLPO's potential for efficient, high-quality diffusion TTS in real-time, resource-limited settings.
- North America > United States > Ohio (0.05)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.52)
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.51)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.41)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)
Finally, a scanner that fits in your pocket and never jams
Instant scans and searchable PDFs are in. SwiftScan VIP puts the power of a full office scanner into your smartphone--and adds some cool extras, too. For a limited time, get lifetime access to SwiftScan for just 41.99 with promo code TAKE30. SwiftScan captures ultra-clear documents with automatic edge detection, color correction, and blur reduction--all starting at 200 dpi. It handles everything from receipts and handwritten notes to contracts, barcodes, business cards, and whiteboards.