Optical Character Recognition
Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak'wala Legacy Texts
Agarwal, Milind, Rosenblum, Daisy, Anastasopoulos, Antonios
Kwak'wala is an Indigenous language spoken in British Columbia, with a rich legacy of published documentation spanning more than a century, and an active community of speakers, teachers, and learners engaged in language revitalization. Over 11 volumes of the earliest texts created during the collaboration between Franz Boas and George Hunt have been scanned but remain unreadable by machines. Complete digitization through optical character recognition has the potential to facilitate transliteration into modern orthographies and the creation of other language technologies. In this paper, we apply the latest OCR techniques to a series of Kwak'wala texts only accessible as images, and discuss the challenges and unique adaptations necessary to make such technologies work for these real-world texts. Building on previous methods, we propose using a mix of off-the-shelf OCR methods, language identification, and masking to effectively isolate Kwak'wala text, along with post-correction models, to produce a final high-quality transcription.
Zero-Shot Text-to-Speech for Vietnamese
Vu, Thi, Nguyen, Linh The, Nguyen, Dat Quoc
This paper introduces PhoAudiobook, a newly curated dataset comprising 941 hours of high-quality audio for Vietnamese text-to-speech. Using PhoAudiobook, we conduct experiments on three leading zero-shot TTS models: VALL-E, VoiceCraft, and XTTS-V2. Our findings demonstrate that PhoAudiobook consistently enhances model performance across various metrics. Moreover, VALL-E and VoiceCraft exhibit superior performance in synthesizing short sentences, highlighting their robustness in handling diverse linguistic contexts. We publicly release PhoAudiobook to facilitate further research and development in Vietnamese text-to-speech.
Reviews: FastSpeech: Fast, Robust and Controllable Text to Speech
Originally: Although phoneme duration prediction is widely adopted in conventional TTS systems, jointly training it in a neural TTS model is new. This paper is one of the first works on non-autoregressive text-to-spectrogram modeling. Quality: This paper seems sound overall, expected for a few issues in the comments below. Some of these issues must be addressed before acceptance. Clarity: A well written paper. Significance: The advantages over its autoregressive counterparts are significant, especially for industrial use.
Reviews: FastSpeech: Fast, Robust and Controllable Text to Speech
The paper proposes a novel non-autoregressive parallelisation approach for mel-spectrogram intermediate representation TTS. The reviewers concur that the paper incorporates two novel explicit components to tts systems - length and duration modules and that the results on Speedup at inference and high-quality audio generations are relevant.
SHDocs: A dataset, benchmark, and method to efficiently generate high-quality, real-world specular highlight data with near-perfect alignment
A frequent problem in vision-based reasoning tasks such as object detection and optical character recognition (OCR) is the persistence of specular highlights. Specular highlights appear as bright spots of glare that occur due to the concentrated reflection of light; these spots manifest as image artifacts which occlude computer vision models and are challenging to reconstruct. Despite this, specular highlight removal receives relatively little attention due to the difficulty of acquiring high-quality, real-world data. We introduce a method to generate specular highlight data with near-perfect alignment and present SHDocs--a dataset of specular highlights on document images created using our method. Through our benchmark, we demonstrate that our dataset enables us to surpass the performance of state-of-the-art specular highlight removal models and downstream OCR tasks.
Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models
Wu, Zhanglin, Song, Tengfei, Xie, Ning, Zhu, Mengli, Zhang, Weidong, Wu, Shuang, Li, Pengfei, Li, Chong, Zhu, Junhao, Yang, Hao, Sun, Shiliang
The rapid advancement of large vision-language models (LVLMs) has significantly propelled applications in document understanding, particularly in optical character recognition (OCR) and multilingual translation. However, current evaluations of LVLMs, like the widely used OCRBench, mainly focus on verifying the correctness of their short-text responses and long-text responses with simple layout, while the evaluation of their ability to understand long texts with complex layout design is highly significant but largely overlooked. In this paper, we propose Menu OCR and Translation Benchmark (MOTBench), a specialized evaluation framework emphasizing the pivotal role of menu translation in cross-cultural communication. MOTBench requires LVLMs to accurately recognize and translate each dish, along with its price and unit items on a menu, providing a comprehensive assessment of their visual understanding and language processing capabilities. Our benchmark is comprised of a collection of Chinese and English menus, characterized by intricate layouts, a variety of fonts, and culturally specific elements across different languages, along with precise human annotations. Experiments show that our automatic evaluation results are highly consistent with professional human evaluation. W e evaluate a range of publicly available state-of-the-art LVLMs, and through analyzing their output to identify the strengths and weaknesses in their performance, offering valuable insights to guide future advancements in LVLM development. MOTBench is available at https://github.com/gitwzl/MOTBench .
Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline
Madhavi, Hrishit, Cherian, Jacob, Khamkar, Yuvraj, Bhagat, Dhananjay
With the abundance of information in today's digital world, it is a major challenge to process voluminous text from news articles, reports, and web pages in an efficient manner. Text summarization solves this problem by providing brief, informative summaries of lengthy documents, both saving end-users time and mental effort [1]. Whereas traditional summarization methods involve only extractive approaches (identifying major sentences out of the source text) and abstractive approaches (producing new sentences capturing the core meaning), the current project outlines a holistic, multi-step NLP pipeline extending beyond mere summarization efforts [1]. The pipeline starts with Optical Character Recognition (OCR), which is achieved with Tesseract (Pytesseract). This module yields machine-readable text from images and handles various languages such as English, Hindi, Tamil, Urdu, Bengali, and Telugu [1]. The extracted information then passes through a chain of Natural Language Processing (NLP) and Machine Learning (ML) modules for more in-depth text analysis. The main elements of this pipeline are: The system combines state-of-the-art NLP features to boost text comprehension and processing.
A document processing pipeline for the construction of a dataset for topic modeling based on the judgments of the Italian Supreme Court
Marulli, Matteo, Panattoni, Glauco, Bertini, Marco
Topic modeling in Italian legal research is hindered by the lack of public datasets, limiting the analysis of legal themes in Supreme Court judgments. To address this, we developed a document processing pipeline that produces an anonymized dataset optimized for topic modeling. The pipeline integrates document layout analysis (YOLOv8x), optical character recognition, and text anonymization. The DLA module achieved a mAP@50 of 0.964 and a mAP@50-95 of 0.800. The OCR detector reached a mAP@50-95 of 0.9022, and the text recognizer (TrOCR) obtained a character error rate of 0.0047 and a word error rate of 0.0248. Compared to OCR-only methods, our dataset improved topic modeling with a diversity score of 0.6198 and a coherence score of 0.6638. We applied BERTopic to extract topics and used large language models to generate labels and summaries. Outputs were evaluated against domain expert interpretations. Claude Sonnet 3.7 achieved a BERTScore F1 of 0.8119 for labeling and 0.9130 for summarization.
A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports
Schรคfer, Henning, Schmidt, Cynthia S., Wutzkowsky, Johannes, Lorek, Kamil, Reinartz, Lea, Rรผckert, Johannes, Temme, Christian, Bรถckmann, Britta, Horn, Peter A., Friedrich, Christoph M.
Despite the growing adoption of electronic health records, many processes still rely on paper documents, reflecting the heterogeneous real-world conditions in which healthcare is delivered. The manual transcription process is time-consuming and prone to errors when transferring paper-based data to digital formats. To streamline this workflow, this study presents an open-source pipeline that extracts and categorizes checkbox data from scanned documents. Demonstrated on transfusion reaction reports, the design supports adaptation to other checkbox-rich document types. The proposed method integrates checkbox detection, multilingual optical character recognition (OCR) and multilingual vision-language models (VLMs). The pipeline achieves high precision and recall compared against annually compiled gold-standards from 2017 to 2024. The result is a reduction in administrative workload and accurate regulatory reporting. The open-source availability of this pipeline encourages self-hosted parsing of checkbox forms.
Use of Metric Learning for the Recognition of Handwritten Digits, and its Application to Increase the Outreach of Voice-based Communication Platforms
Pant, Devesh, Talukder, Dibyendu, Kumar, Deepak, Pandey, Rachit, Seth, Aaditeshwar, Arora, Chetan
Initiation, monitoring, and evaluation of development programmes can involve field-based data collection about project activities. This data collection through digital devices may not always be feasible though, for reasons such as unaffordability of smartphones and tablets by field-based cadre, or shortfalls in their training and capacity building. Paper-based data collection has been argued to be more appropriate in several contexts, with automated digitization of the paper forms through OCR (Optical Character Recognition) and OMR (Optical Mark Recognition) techniques. We contribute with providing a large dataset of handwritten digits, and deep learning based models and methods built using this data, that are effective in real-world environments. We demonstrate the deployment of these tools in the context of a maternal and child health and nutrition awareness project, which uses IVR (Interactive Voice Response) systems to provide awareness information to rural women SHG (Self Help Group) members in north India. Paper forms were used to collect phone numbers of the SHG members at scale, which were digitized using the OCR tools developed by us, and used to push almost 4 million phone calls. The data, model, and code have been released in the open-source domain.