MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts
Piryani, Bhawna, Mozafari, Jamshid, Abdallah, Abdelrahman, Doucet, Antoine, Jatowt, Adam
–arXiv.org Artificial Intelligence
Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors -- imperfect extraction of the text, including character insertion, deletion and permutation -- can significantly impact downstream tasks like question-answering (QA). In this work, we introduce a multilingual QA dataset MultiOCR-QA, designed to analyze the effects of OCR noise on QA systems' performance. The MultiOCR-QA dataset comprises 60K question-answer pairs covering three languages, English, French, and German. The dataset is curated from OCR-ed old documents, allowing for the evaluation of OCR-induced challenges on question answering. We evaluate MultiOCR-QA on various levels and types of OCR errors to access the robustness of LLMs in handling real-world digitization errors. Our findings show that QA systems are highly prone to OCR induced errors and exhibit performance degradation on noisy OCR text.
arXiv.org Artificial Intelligence
Feb-23-2025
- Country:
- Africa > Middle East
- Morocco (0.04)
- Asia
- China > Jiangsu Province (0.04)
- Malaysia > Kuala Lumpur
- Kuala Lumpur (0.04)
- Europe
- Austria > Tyrol
- Innsbruck (0.05)
- France > Auvergne-Rhône-Alpes
- Greece > Central Macedonia
- Thessaloniki (0.04)
- Poland > Greater Poland Province
- Poznań (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- Austria > Tyrol
- North America
- Canada > Ontario
- Toronto (0.04)
- Dominican Republic (0.04)
- United States
- District of Columbia > Washington (0.04)
- New York > New York County
- New York City (0.04)
- Oregon > Multnomah County
- Portland (0.04)
- Texas > Travis County
- Austin (0.04)
- Canada > Ontario
- Oceania > Australia
- Africa > Middle East
- Genre:
- Research Report > New Finding (1.00)
- Technology: