Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts

Sohail, Muhammad Abdullah, Masood, Salaar, Iqbal, Hamza

Dec-20-2024–arXiv.org Artificial Intelligence

Synthetic datasets further enhanced these This study investigates the potential of Large Language Models models by mitigating the scarcity of annotated data for specific (LLMs), particularly GPT-4o, for Optical Character Recognition languages [3, 12] (OCR) in low-resource scripts such as Urdu, Albanian, and Tajik, Progress in low-resource OCR remains limited, with studies focusing with English serving as a benchmark. Using a meticulously curated on Indic scripts, while structurally complex scripts--such as dataset of 2,520 images incorporating controlled variations in text those with ligatures or modified Cyrillic alphabets--remain underrepresented length, font size, background color, and blur, the research simulates [12]. Similarly, few studies have explored OCR pipelines diverse real-world challenges. Results emphasize the limitations of for Urdu and Bengali [1, 2], highlighting the need for customized zero-shot LLM-based OCR, particularly for linguistically complex solutions. LLMs have recently emerged as a promising OCR solution, scripts, highlighting the need for annotated datasets and fine-tuned leveraging multimodal capabilities for dynamic adaptation to models. This work underscores the urgency of addressing accessibility diverse scripts and challenging visual conditions [10, 13]. Studies gaps in text digitization, paving the way for inclusive and have demonstrated LLMs' ability to process textual content from robust OCR solutions for underserved languages.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Dec-20-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.17)
- Asia
  - Pakistan (0.14)
  - Tajikistan (0.14)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.90)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found