Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline
Madhavi, Hrishit, Cherian, Jacob, Khamkar, Yuvraj, Bhagat, Dhananjay
–arXiv.org Artificial Intelligence
With the abundance of information in today's digital world, it is a major challenge to process voluminous text from news articles, reports, and web pages in an efficient manner. Text summarization solves this problem by providing brief, informative summaries of lengthy documents, both saving end-users time and mental effort [1]. Whereas traditional summarization methods involve only extractive approaches (identifying major sentences out of the source text) and abstractive approaches (producing new sentences capturing the core meaning), the current project outlines a holistic, multi-step NLP pipeline extending beyond mere summarization efforts [1]. The pipeline starts with Optical Character Recognition (OCR), which is achieved with Tesseract (Pytesseract). This module yields machine-readable text from images and handles various languages such as English, Hindi, Tamil, Urdu, Bengali, and Telugu [1]. The extracted information then passes through a chain of Natural Language Processing (NLP) and Machine Learning (ML) modules for more in-depth text analysis. The main elements of this pipeline are: The system combines state-of-the-art NLP features to boost text comprehension and processing.
arXiv.org Artificial Intelligence
May-19-2025