Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts

Sohail, Muhammad Abdullah, Masood, Salaar, Iqbal, Hamza

arXiv.org Artificial Intelligence 

Synthetic datasets further enhanced these This study investigates the potential of Large Language Models models by mitigating the scarcity of annotated data for specific (LLMs), particularly GPT-4o, for Optical Character Recognition languages [3, 12] (OCR) in low-resource scripts such as Urdu, Albanian, and Tajik, Progress in low-resource OCR remains limited, with studies focusing with English serving as a benchmark. Using a meticulously curated on Indic scripts, while structurally complex scripts--such as dataset of 2,520 images incorporating controlled variations in text those with ligatures or modified Cyrillic alphabets--remain underrepresented length, font size, background color, and blur, the research simulates [12]. Similarly, few studies have explored OCR pipelines diverse real-world challenges. Results emphasize the limitations of for Urdu and Bengali [1, 2], highlighting the need for customized zero-shot LLM-based OCR, particularly for linguistically complex solutions. LLMs have recently emerged as a promising OCR solution, scripts, highlighting the need for annotated datasets and fine-tuned leveraging multimodal capabilities for dynamic adaptation to models. This work underscores the urgency of addressing accessibility diverse scripts and challenging visual conditions [10, 13]. Studies gaps in text digitization, paving the way for inclusive and have demonstrated LLMs' ability to process textual content from robust OCR solutions for underserved languages.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found