scanned document
Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Wang, Baode, Wu, Biao, Li, Weizhen, Fang, Meng, Huang, Zuming, Huang, Jun, Wang, Haozhe, Liang, Yanjie, Chen, Ling, Chu, Wei, Qi, Yuan
Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
- Europe > Russia (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > Russia (0.04)
LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation
Sobhan, Shadman, Haque, Mohammad Ariful
Large Language Models (LLMs) are capable of natural language understanding and generation. But they face challenges such as hallucination and outdated knowledge. Fine-tuning is one possible solution, but it is resource-intensive and must be repeated with every data update. Retrieval-Augmented Generation (RAG) offers an efficient solution by allowing LLMs to access external knowledge sources. However, traditional RAG pipelines struggle with retrieving information from complex technical documents with structured data such as tables and images. In this work, we propose a RAG pipeline, capable of handling tables and images in documents, for technical documents that support both scanned and searchable formats. Its retrieval process combines vector similarity search with a fine-tuned reranker based on Gemma-2-9b-it. The reranker is trained using RAFT (Retrieval-Augmented Fine-Tuning) on a custom dataset designed to improve context identification for question answering. Our evaluation demonstrates that the proposed pipeline achieves a high faithfulness score of 94% (RAGas) and 96% (DeepEval), and an answer relevancy score of 87% (RAGas) and 93% (DeepEval). Comparative analysis demonstrates that the proposed architecture is superior to general RAG pipelines in terms of table-based questions and handling questions outside context.
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
Predicting the Past: Estimating Historical Appraisals with OCR and Machine Learning
Bhaskar, Mihir, Luo, Jun Tao, Geng, Zihan, Hajra, Asmita, Howell, Junia, Gormley, Matthew R.
Despite well-documented consequences of the U.S. government's 1930s housing policies on racial wealth disparities, scholars have struggled to quantify its precise financial effects due to the inaccessibility of historical property appraisal records. Many counties still store these records in physical formats, making large-scale quantitative analysis difficult. We present an approach scholars can use to digitize historical housing assessment data, applying it to build and release a dataset for one county. Starting from publicly available scanned documents, we manually annotated property cards for over 12,000 properties to train and validate our methods. We use OCR to label data for an additional 50,000 properties, based on our two-stage approach combining classical computer vision techniques with deep learning-based OCR. For cases where OCR cannot be applied, such as when scanned documents are not available, we show how a regression model based on building feature data can estimate the historical values, and test the generalizability of this model to other counties. With these cost-effective tools, scholars, community activists, and policy makers can better analyze and understand the historical impacts of redlining.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > Canada > Ontario > Toronto (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (11 more...)
A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports
Schäfer, Henning, Schmidt, Cynthia S., Wutzkowsky, Johannes, Lorek, Kamil, Reinartz, Lea, Rückert, Johannes, Temme, Christian, Böckmann, Britta, Horn, Peter A., Friedrich, Christoph M.
Despite the growing adoption of electronic health records, many processes still rely on paper documents, reflecting the heterogeneous real-world conditions in which healthcare is delivered. The manual transcription process is time-consuming and prone to errors when transferring paper-based data to digital formats. To streamline this workflow, this study presents an open-source pipeline that extracts and categorizes checkbox data from scanned documents. Demonstrated on transfusion reaction reports, the design supports adaptation to other checkbox-rich document types. The proposed method integrates checkbox detection, multilingual optical character recognition (OCR) and multilingual vision-language models (VLMs). The pipeline achieves high precision and recall compared against annually compiled gold-standards from 2017 to 2024. The result is a reduction in administrative workload and accurate regulatory reporting. The open-source availability of this pipeline encourages self-hosted parsing of checkbox forms.
- Asia > Singapore (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (5 more...)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Health Care Technology > Medical Record (0.69)
Transformers and Language Models in Form Understanding: A Comprehensive Review of Scanned Document Analysis
Abdallah, Abdelrahman, Eberharter, Daniel, Pfister, Zoe, Jatowt, Adam
This paper presents a comprehensive survey of research works on the topic of form understanding in the context of scanned documents. We delve into recent advancements and breakthroughs in the field, highlighting the significance of language models and transformers in solving this challenging task. Our research methodology involves an in-depth analysis of popular documents and forms of understanding of trends over the last decade, enabling us to offer valuable insights into the evolution of this domain. Focusing on cutting-edge models, we showcase how transformers have propelled the field forward, revolutionizing form-understanding techniques. Our exploration includes an extensive examination of state-of-the-art language models designed to effectively tackle the complexities of noisy scanned documents. Furthermore, we present an overview of the latest and most relevant datasets, which serve as essential benchmarks for evaluating the performance of selected models. By comparing and contrasting the capabilities of these models, we aim to provide researchers and practitioners with useful guidance in choosing the most suitable solutions for their specific form understanding tasks.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Austria > Tyrol > Innsbruck (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (2 more...)
- Overview (1.00)
- Research Report > Promising Solution (0.87)
Document Understanding for Healthcare Referrals
Mistry, Jimit, Arzeno, Natalia M.
Reliance on scanned documents and fax communication for healthcare referrals leads to high administrative costs and errors that may affect patient care. In this work we propose a hybrid model leveraging LayoutLMv3 along with domain-specific rules to identify key patient, physician, and exam-related entities in faxed referral documents. We explore some of the challenges in applying a document understanding model to referrals, which have formats varying by medical practice, and evaluate model performance using MUC-5 metrics to obtain appropriate metrics for the practical use case. Our analysis shows the addition of domain-specific rules to the transformer model yields greatly increased precision and F1 scores, suggesting a hybrid model trained on a curated dataset can increase efficiency in referral management.
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.35)
FoodGPT: A Large Language Model in Food Testing Domain with Incremental Pre-training and Knowledge Graph Prompt
Qi, Zhixiao, Yu, Yijiong, Tu, Meiqi, Tan, Junyi, Huang, Yongfeng
Large language models (LLM) [1] have gained significant research importance in the field of natural language processing. Models such as ChatGPT, LLaMA [2], GPT-4, ChatGLM [3], and PaLM [4] have demonstrated outstanding performance in downstream tasks. The powerful ability of LLM in understanding human instructions has led to continuous research on LLMs in various vertical domains. ChatLaw [5] is based on Ziya-LLaMA-13B and utilizes legal data for instruction fine-tuning, incorporating vector database retrieval to create a legal LLM. DoctorGLM [6] is built upon ChatGLM-6B and fine-tuned using Chinese medical dialogue datasets to create a Chinese medical consultation model. BenTsao is based on LLaMA-7B and constructs a Chinese medical LLM by leveraging a medical knowledge graph and the GPT-3.5 API to build a Chinese medical instruction dataset. Cornucopia, on the other hand, is based on LLaMA-7B and constructs an instruction dataset using Chinese financial public data and crawled financial data, focusing on question-answering in the financial domain. Previous research assume that the base models have already injected the corresponding domain knowledge, hence no incremental pre-training is performed on the base models.
Transformer-Based UNet with Multi-Headed Cross-Attention Skip Connections to Eliminate Artifacts in Scanned Documents
The extraction of text in high quality is essential for text-based document analysis tasks like Document Classification or Named Entity Recognition. Unfortunately, this is not always ensured, as poor scan quality and the resulting artifacts lead to errors in the Optical Character Recognition (OCR) process. Current approaches using Convolutional Neural Networks show promising results for background removal tasks but fail correcting artifacts like pixelation or compression errors. For general images, Transformer backbones are getting integrated more frequently in well-known neural network structures for denoising tasks. In this work, a modified UNet structure using a Swin Transformer backbone is presented to remove typical artifacts in scanned documents. Multi-headed cross-attention skip connections are used to more selectively learn features in respective levels of abstraction. The performance of this approach is examined regarding compression errors, pixelation and random noise. An improvement in text extraction quality with a reduced error rate of up to 53.9% on the synthetic data is archived. The pretrained base-model can be easily adapted to new artifacts. The cross-attention skip connections allow to integrate textual information extracted from the encoder or in form of commands to more selectively control the models outcome. The latter is shown by means of an example application.
End-to-End Document Classification and Key Information Extraction using Assignment Optimization
Cooney, Ciaran, Cavadas, Joana, Madigan, Liam, Savage, Bradley, Heyburn, Rachel, O'Cuinn, Mairead
We propose end-to-end document classification and key information extraction (KIE) for automating document processing in forms. Through accurate document classification we harness known information from templates to enhance KIE from forms. We use text and layout encoding with a cosine similarity measure to classify visually-similar documents. We then demonstrate a novel application of mixed integer programming by using assignment optimization to extract key information from documents. Our approach is validated on an in-house dataset of noisy scanned forms. The best performing document classification approach achieved 0.97 f1 score. A mean f1 score of 0.94 for the KIE task suggests there is significant potential in applying optimization techniques. Abation results show that the method relies on document preprocessing techniques to mitigate Type II errors and achieve optimal performance.
- Europe > United Kingdom > Northern Ireland (0.04)
- Europe > Ireland > Munster (0.04)
- Research Report (1.00)
- Overview > Innovation (0.34)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
OCR is getting super cool for Businesses
A Few months back, the student in class captured the image of the notes made by the other student in front of him and used iOS 15's recent text-recognition feature to highlight text, and copy and paste it into his notes. This instance was tweeted by @juanbuis, who shared the video of a student making the most of iOS 15's Live Text OCR feature. This cool OCR or Optical Character Recognition feature that the above student opts for is generally used to pull up the information from the text or documents and then convert it into the machine's language. Recently, the popular app developer Alessandro Paluzzi has also seen that Twitter is working on an OCR (optical character recognition) feature for the description of alt text. In his tweet, Alessandro Paluzzi shared the demonstration of how this twitter feature will function through a short video. At Dwarf AI we too want to make this super cool technology to be easily accessible by other businesses.