Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts
Sohail, Muhammad Abdullah, Masood, Salaar, Iqbal, Hamza
–arXiv.org Artificial Intelligence
Synthetic datasets further enhanced these This study investigates the potential of Large Language Models models by mitigating the scarcity of annotated data for specific (LLMs), particularly GPT-4o, for Optical Character Recognition languages [3, 12] (OCR) in low-resource scripts such as Urdu, Albanian, and Tajik, Progress in low-resource OCR remains limited, with studies focusing with English serving as a benchmark. Using a meticulously curated on Indic scripts, while structurally complex scripts--such as dataset of 2,520 images incorporating controlled variations in text those with ligatures or modified Cyrillic alphabets--remain underrepresented length, font size, background color, and blur, the research simulates [12]. Similarly, few studies have explored OCR pipelines diverse real-world challenges. Results emphasize the limitations of for Urdu and Bengali [1, 2], highlighting the need for customized zero-shot LLM-based OCR, particularly for linguistically complex solutions. LLMs have recently emerged as a promising OCR solution, scripts, highlighting the need for annotated datasets and fine-tuned leveraging multimodal capabilities for dynamic adaptation to models. This work underscores the urgency of addressing accessibility diverse scripts and challenging visual conditions [10, 13]. Studies gaps in text digitization, paving the way for inclusive and have demonstrated LLMs' ability to process textual content from robust OCR solutions for underserved languages.
arXiv.org Artificial Intelligence
Dec-20-2024
- Country:
- Asia
- Indonesia > Bali (0.04)
- Pakistan > Punjab
- Lahore Division > Lahore (0.05)
- Tajikistan (0.14)
- Europe > Albania (0.04)
- North America > United States
- District of Columbia > Washington (0.06)
- Asia
- Genre:
- Research Report > New Finding (0.46)
- Technology: