Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

Madhavi, Hrishit, Cherian, Jacob, Khamkar, Yuvraj, Bhagat, Dhananjay

May-19-2025–arXiv.org Artificial Intelligence

With the abundance of information in today's digital world, it is a major challenge to process voluminous text from news articles, reports, and web pages in an efficient manner. Text summarization solves this problem by providing brief, informative summaries of lengthy documents, both saving end-users time and mental effort [1]. Whereas traditional summarization methods involve only extractive approaches (identifying major sentences out of the source text) and abstractive approaches (producing new sentences capturing the core meaning), the current project outlines a holistic, multi-step NLP pipeline extending beyond mere summarization efforts [1]. The pipeline starts with Optical Character Recognition (OCR), which is achieved with Tesseract (Pytesseract). This module yields machine-readable text from images and handles various languages such as English, Hindi, Tamil, Urdu, Bengali, and Telugu [1]. The extracted information then passes through a chain of Natural Language Processing (NLP) and Machine Learning (ML) modules for more in-depth text analysis. The main elements of this pipeline are: The system combines state-of-the-art NLP features to boost text comprehension and processing.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

May-19-2025

arXiv.org PDF

Add feedback

Country:
- Asia > India > Maharashtra > Pune (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Vision > Optical Character Recognition (0.73)
  - Natural Language
    - Text Processing (1.00)
    - Machine Translation (0.73)
    - Large Language Model (0.70)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found