Optical Character Recognition
Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding
Al-Homoud, Haneen, Ibrahim, Asma, Al-Jubran, Murtadha, Al-Otaibi, Fahad, Al-Harbi, Yazeed, Toibazar, Daulet, Wang, Kesen, Moreno, Pedro J.
Abstract--Cross-Lingual SynthDocs is a large-scale synthetic corpus designed to address the scarcity of Arabic resources for Optical Character Recognition (OCR) and Document Understanding (DU). The dataset comprises over 2.5 million of samples, including 1.5 million textual data, 270K fully annotated tables, and hundred thousands of real data based charts. Our pipeline leverages authentic scanned backgrounds, bilingual layouts, and diacritic aware fonts to capture the typographic and structural complexity of Arabic documents. In addition to text, the corpus includes variety of rendered styles for charts and tables. Finetuning Qwen-2.5-VL on SynthDocs yields consistent improvements in Word Error Rate (WER) and Character Error Rate (CER) in terms of OCR across multiple public Arabic benchmarks, Tree-Edit Distance Similarity (TEDS) and Chart Extraction Score (CharT eX) improved as well in other modalities. SynthDocs provides a scalable, visually realistic resource for advancing research in multilingual document analysis.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Middle East > Saudi Arabia > Riyadh Province > Riyadh (0.04)
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.87)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Seeing Straight: Document Orientation Detection for Efficient OCR
Goswami, Suranjan, Ravi, Abhinav, Kolla, Raja, Faraz, Ali, Khan, Shaharukh, Akash, null, Khatri, Chandra, Agarwal, Shubham
Despite significant advances in document understanding, determining the correct orientation of scanned or photographed documents remains a critical pre-processing step in the real world settings. Accurate rotation correction is essential for enhancing the performance of downstream tasks such as Optical Character Recognition (OCR) where misalignment commonly arises due to user errors, particularly incorrect base orientations of the camera during capture. In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii) ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource languages. We also present a fast, robust and lightweight rotation classification pipeline built on the vision encoder of Phi-3.5-Vision model with dynamic image cropping, fine-tuned specifically for 4-class rotation task in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy on identifying the rotations respectively on both the datasets. Beyond classification, we demonstrate the critical role of our module in boosting OCR performance: closed-source (up to 14%) and open-weights models (up to 4x) in the simulated real-world setting.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.87)
OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment
We present OCR-Quality, a comprehensive human-annotated dataset designed for evaluating and developing OCR quality assessment methods. The dataset consists of 1,000 PDF pages converted to PNG images at 300 DPI, sampled from diverse real-world scenarios, including academic papers, textbooks, e-books, and multilingual documents. Each document has been processed using state-of-the-art Vision-Language Models (VLMs) and manually annotated with quality scores using a 4-level scoring system (1: Excellent, 2: Good, 3: Fair, 4: Poor). The dataset includes detailed source information, annotation guidelines, and representative cases across various difficulty levels. OCR-Quality addresses the critical need for reliable OCR quality assessment in real-world applications and provides a valuable benchmark for training and evaluating OCR verification systems. The dataset is publicly available at https://huggingface.co/datasets/Aslan-mingye/OCR-Quality .
- North America > United States > California (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.49)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
AI based signage classification for linguistic landscape studies
Jiang, Yuqin, Jiang, Song, Algrim, Jacob, Harms, Trevor, Koenen, Maxwell, Lan, Xinya, Li, Xingyu, Lin, Chun-Han, Liu, Jia, Sun, Jiayang, Zenger, Henry
Linguistic Landscape (LL) research traditionally relies on manual photography and annotation of public signages to examine distribution of languages in urban space. While such methods yield valuable findings, the process is time-consuming and difficult for large study areas. This study explores the use of AI powered language detection method to automate LL analysis. Using Honolulu Chinatown as a case study, we constructed a georeferenced photo dataset of 1,449 images collected by researchers and applied AI for optical character recognition (OCR) and language classification. We also conducted manual validations for accuracy checking. This model achieved an overall accuracy of 79%. Five recurring types of mislabeling were identified, including distortion, reflection, degraded surface, graffiti, and hallucination. The analysis also reveals that the AI model treats all regions of an image equally, detecting peripheral or background texts that human interpreters typically ignore. Despite these limitations, the results demonstrate the potential of integrating AI-assisted workflows into LL research to reduce such time-consuming processes. However, due to all the limitations and mis-labels, we recognize that AI cannot be fully trusted during this process. This paper encourages a hybrid approach combining AI automation with human validation for a more reliable and efficient workflow.
- North America > United States > Hawaii > Honolulu County > Honolulu (0.29)
- North America > United States > District of Columbia > Washington (0.04)
- Asia > China (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Applied AI (1.00)
- Information Technology > Communications > Mobile (0.68)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.54)
A popular new PDF editor is now only 40 for life (reg. 119)
When you purchase through links in our articles, we may earn a small commission. A popular new PDF editor is now only $40 for life (reg. Instead of paying for Acrobat indefinitely, now you can get PDF Agile for life on sale for $39.99 (reg. PDF Agile also includes annotation and collaboration tools, so you can highlight, comment, or mark up files during review. This plan lasts for life with free updates and no recurring costs.
- Information Technology > Security & Privacy (0.62)
- Leisure & Entertainment > Games > Computer Games (0.52)
- Information Technology > Hardware (0.63)
- Information Technology > Security & Privacy (0.62)
- Information Technology > Communications (0.50)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.32)
KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers
Ameen, Mohd Ruhul, Islam, Akif, Aktar, Farjana, Rafat, M. Saifuzzaman
In Bangladesh, many farmers continue to face challenges in accessing timely, expert-level agricultural guidance. This paper presents KrishokBondhu, a voice-enabled, call-centre-integrated advisory platform built on a Retrieval-Augmented Generation (RAG) framework, designed specifically for Bengali-speaking farmers. The system aggregates authoritative agricultural handbooks, extension manuals, and NGO publications; applies Optical Character Recognition (OCR) and document-parsing pipelines to digitize and structure the content; and indexes this corpus in a vector database for efficient semantic retrieval. Through a simple phone-based interface, farmers can call the system to receive real-time, context-aware advice: speech-to-text converts the Bengali query, the RAG module retrieves relevant content, a large language model (Gemma 3-4B) generates a context-grounded response, and text-to-speech delivers the answer in natural spoken Bengali. In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries covering crop management, disease control, and cultivation practices. Compared to the KisanQRS benchmark, the system achieved a composite score of 4.53 (vs. 3.13) on a 5-point scale, a 44.7% improvement, with especially large gains in contextual richness (+367%) and completeness (+100.4%), while maintaining comparable relevance and technical specificity. Semantic similarity analysis further revealed a strong correlation between retrieved context and answer quality, emphasizing the importance of grounding generative responses in curated documentation. KrishokBondhu demonstrates the feasibility of integrating call-centre accessibility, multilingual voice interaction, and modern RAG techniques to deliver expert-level agricultural guidance to remote Bangladeshi farmers, paving the way toward a fully AI-driven agricultural advisory ecosystem.
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.05)
- North America > United States > West Virginia > Cabell County > Huntington (0.04)
- North America > Canada (0.04)
- Asia > India > West Bengal (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.89)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
Automated Wicket-Taking Delivery Segmentation and Weakness Detection in Cricket Videos Using OCR-Guided YOLOv8 and Trajectory Modeling
Ferdous, Mst Jannatun, Billah, Masum, Karmoker, Joy, Ameen, Mohd Ruhul, Islam, Akif, Faruqe, Md. Omar
This paper presents an automated system for cricket video analysis that leverages deep learning techniques to extract wicket-taking deliveries, detect cricket balls, and model ball trajectories. The system employs the YOLOv8 architecture for pitch and ball detection, combined with optical character recognition (OCR) for scorecard extraction to identify wicket-taking moments. Through comprehensive image preprocessing, including grayscale transformation, power transformation, and morphological operations, the system achieves robust text extraction from video frames. The pitch detection model achieved 99.5% mean Average Precision at 50% IoU (mAP50) with a precision of 0.999, while the ball detection model using transfer learning attained 99.18% mAP50 with 0.968 precision and 0.978 recall. The system enables trajectory modeling on detected pitches, providing data-driven insights for identifying batting weaknesses. Experimental results on multiple cricket match videos demonstrate the effectiveness of this approach for automated cricket analytics, offering significant potential for coaching and strategic decision-making.
- North America > United States > West Virginia > Cabell County > Huntington (0.04)
- Asia > Bangladesh (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.68)
VOLTAGE: A Versatile Contrastive Learning based OCR Methodology for ultra low-resource scripts through Auto Glyph Feature Extraction
Sharma, Prawaal, Goyal, Poonam, Sharma, Vidisha, Goyal, Navneet
UNESCO has classified 2500 out of 7000 languages spoken worldwide as endangered. Attrition of a language leads to loss of traditional wisdom, folk literature, and the essence of the community that uses it. It is therefore imperative to bring digital inclusion to these languages and avoid its extinction. Low resource languages are at a greater risk of extinction. Lack of unsupervised Optical Character Recognition(OCR) methodologies for low resource languages is one of the reasons impeding their digital inclusion. We propose VOLTAGE - a contrastive learning based OCR methodology, leveraging auto-glyph feature recommendation for cluster-based labelling. We augment the labelled data for diversity and volume using image transformations and Generative Adversarial Networks. Voltage has been designed using Takri - a family of scripts used in 16th to 20th century in the Himalayan regions of India. We present results for Takri along with other Indic scripts (both low and high resource) to substantiate the universal behavior of the methodology. An accuracy of 95% for machine printed and 87% for handwritten samples on Takri script has been achieved. We conduct baseline and ablation studies along with building downstream use cases for Takri, demonstrating the usefulness of our work.
- Asia > India > Rajasthan (0.04)
- Asia > India > Maharashtra > Pune (0.04)
- Asia > India > Himachal Pradesh (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
A Large-Language-Model Assisted Automated Scale Bar Detection and Extraction Framework for Scanning Electron Microscopic Images
Chen, Yuxuan, Yang, Ruotong, Zhang, Zhengyang, Ahmed, Mehreen, Wang, Yanming
Microscopic characterizations, such as Scanning Electron Microscopy (SEM), are widely used in scientific research for visualizing and analyzing microstructures. Determining the scale bars is an important first step of accurate SEM analysis; however, currently, it mainly relies on manual operations, which is both time-consuming and prone to errors. To address this issue, we propose a multi-modal and automated scale bar detection and extraction framework that provides concurrent object detection, text detection and text recognition with a Large Language Model (LLM) agent. The proposed framework operates in four phases; i) Automatic Dataset Generation (Auto-DG) model to synthesize a diverse dataset of SEM images ensuring robust training and high generalizability of the model, ii) scale bar object detection, iii) information extraction using a hybrid Optical Character Recognition (OCR) system with DenseNet and Convolutional Recurrent Neural Network (CRNN) based algorithms, iv) an LLM agent to analyze and verify accuracy of the results. The proposed model demonstrates a strong performance in object detection and accurate localization with a precision of 100%, recall of 95.8%, and a mean Average Precision (mAP) of 99.2% at IoU=0.5 and 69.1% at IoU=0.5:0.95. The hybrid OCR system achieved 89% precision, 65% recall, and a 75% F1 score on the Auto-DG dataset, significantly outperforming several mainstream standalone engines, highlighting its reliability for scientific image analysis. The LLM is introduced as a reasoning engine as well as an intelligent assistant that suggests follow-up steps and verifies the results. This automated method powered by an LLM agent significantly enhances the efficiency and accuracy of scale bar detection and extraction in SEM images, providing a valuable tool for microscopic analysis and advancing the field of scientific imaging.
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech
Kolani, Yakov, Melichov, Maxim, Calev, Cobi, Alper, Morris
Real-time text-to-speech (TTS) for Modern Hebrew is challenging due to the language's orthographic complexity. Existing solutions ignore crucial phonetic features such as stress that remain underspecified even when vowel marks are added. To address these limitations, we introduce Phonikud, a lightweight, open-source Hebrew grapheme-to-phoneme (G2P) system that outputs fully-specified IPA transcriptions. Our approach adapts an existing diacritization model with lightweight adaptors, incurring negligible additional latency. We also contribute the ILSpeech dataset of transcribed Hebrew speech with IPA annotations, serving as a benchmark for Hebrew G2P, as training data for TTS systems, and enabling audio-to-IPA for evaluating TTS performance while capturing important phonetic details. Our results demonstrate that Phonikud G2P conversion more accurately predicts phonemes from Hebrew text compared to prior methods, and that this enables training of effective real-time Hebrew TTS models with superior speed-accuracy trade-offs. We release our code, data, and models at https: //phonikud.github.io.
- North America > United States > Texas > Dallas County > Dallas (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.71)
- Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.71)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.61)
- (3 more...)