AITopics | scanned document

Collaborating Authors

scanned document

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

Wang, Baode, Wu, Biao, Li, Weizhen, Fang, Meng, Huang, Zuming, Huang, Jun, Wang, Haozhe, Liang, Yanjie, Chen, Ling, Chu, Wei, Qi, Yuan

arXiv.org Artificial IntelligenceOct-22-2025

Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2506.03197

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
(2 more...)

Add feedback

LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation

Sobhan, Shadman, Haque, Mohammad Ariful

arXiv.org Artificial IntelligenceJul-1-2025

Large Language Models (LLMs) are capable of natural language understanding and generation. But they face challenges such as hallucination and outdated knowledge. Fine-tuning is one possible solution, but it is resource-intensive and must be repeated with every data update. Retrieval-Augmented Generation (RAG) offers an efficient solution by allowing LLMs to access external knowledge sources. However, traditional RAG pipelines struggle with retrieving information from complex technical documents with structured data such as tables and images. In this work, we propose a RAG pipeline, capable of handling tables and images in documents, for technical documents that support both scanned and searchable formats. Its retrieval process combines vector similarity search with a fine-tuned reranker based on Gemma-2-9b-it. The reranker is trained using RAFT (Retrieval-Augmented Fine-Tuning) on a custom dataset designed to improve context identification for question answering. Our evaluation demonstrates that the proposed pipeline achieves a high faithfulness score of 94% (RAGas) and 96% (DeepEval), and an answer relevancy score of 87% (RAGas) and 93% (DeepEval). Comparative analysis demonstrates that the proposed architecture is superior to general RAG pipelines in terms of table-based questions and handling questions outside context.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.23136

Country: Asia (0.28)

Genre: Research Report (1.00)

Industry: Materials > Chemicals > Commodity Chemicals > Petrochemicals > Polymers & Plastics (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Predicting the Past: Estimating Historical Appraisals with OCR and Machine Learning

Bhaskar, Mihir, Luo, Jun Tao, Geng, Zihan, Hajra, Asmita, Howell, Junia, Gormley, Matthew R.

arXiv.org Artificial IntelligenceJun-2-2025

Despite well-documented consequences of the U.S. government's 1930s housing policies on racial wealth disparities, scholars have struggled to quantify its precise financial effects due to the inaccessibility of historical property appraisal records. Many counties still store these records in physical formats, making large-scale quantitative analysis difficult. We present an approach scholars can use to digitize historical housing assessment data, applying it to build and release a dataset for one county. Starting from publicly available scanned documents, we manually annotated property cards for over 12,000 properties to train and validate our methods. We use OCR to label data for an additional 50,000 properties, based on our two-stage approach combining classical computer vision techniques with deep learning-based OCR. For cases where OCR cannot be applied, such as when scanned documents are not available, we show how a regression model based on building feature data can estimate the historical values, and test the generalizability of this model to other counties. With these cost-effective tools, scholars, community activists, and policy makers can better analyze and understand the historical impacts of redlining.

artificial intelligence, machine learning, regression model, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3715335.3735488

2505.24676

Country:

North America > United States > Ohio (0.68)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Genre: Research Report (0.82)

Industry:

Banking & Finance > Real Estate (1.00)
Government > Regional Government > North America Government > United States Government (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports

Schäfer, Henning, Schmidt, Cynthia S., Wutzkowsky, Johannes, Lorek, Kamil, Reinartz, Lea, Rückert, Johannes, Temme, Christian, Böckmann, Britta, Horn, Peter A., Friedrich, Christoph M.

arXiv.org Artificial IntelligenceApr-30-2025

Despite the growing adoption of electronic health records, many processes still rely on paper documents, reflecting the heterogeneous real-world conditions in which healthcare is delivered. The manual transcription process is time-consuming and prone to errors when transferring paper-based data to digital formats. To streamline this workflow, this study presents an open-source pipeline that extracts and categorizes checkbox data from scanned documents. Demonstrated on transfusion reaction reports, the design supports adaptation to other checkbox-rich document types. The proposed method integrates checkbox detection, multilingual optical character recognition (OCR) and multilingual vision-language models (VLMs). The pipeline achieves high precision and recall compared against annually compiled gold-standards from 2017 to 2024. The result is a reduction in administrative workload and accurate regulatory reporting. The open-source availability of this pipeline encourages self-hosted parsing of checkbox forms.

category, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.2022

Country: North America > United States (0.69)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Health Care Technology > Medical Record (0.69)
Health & Medicine > Therapeutic Area > Oncology (0.65)
Health & Medicine > Therapeutic Area > Hematology (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.69)

Add feedback

Transformers and Language Models in Form Understanding: A Comprehensive Review of Scanned Document Analysis

Abdallah, Abdelrahman, Eberharter, Daniel, Pfister, Zoe, Jatowt, Adam

arXiv.org Artificial IntelligenceMar-6-2024

This paper presents a comprehensive survey of research works on the topic of form understanding in the context of scanned documents. We delve into recent advancements and breakthroughs in the field, highlighting the significance of language models and transformers in solving this challenging task. Our research methodology involves an in-depth analysis of popular documents and forms of understanding of trends over the last decade, enabling us to offer valuable insights into the evolution of this domain. Focusing on cutting-edge models, we showcase how transformers have propelled the field forward, revolutionizing form-understanding techniques. Our exploration includes an extensive examination of state-of-the-art language models designed to effectively tackle the complexities of noisy scanned documents. Furthermore, we present an overview of the latest and most relevant datasets, which serve as essential benchmarks for evaluating the performance of selected models. By comparing and contrasting the capabilities of these models, we aim to provide researchers and practitioners with useful guidance in choosing the most suitable solutions for their specific form understanding tasks.

dataset, information, transformer, (15 more...)

arXiv.org Artificial Intelligence

2403.0408

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Austria > Tyrol > Innsbruck (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(2 more...)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)

Add feedback

Document Understanding for Healthcare Referrals

Mistry, Jimit, Arzeno, Natalia M.

arXiv.org Artificial IntelligenceSep-22-2023

Reliance on scanned documents and fax communication for healthcare referrals leads to high administrative costs and errors that may affect patient care. In this work we propose a hybrid model leveraging LayoutLMv3 along with domain-specific rules to identify key patient, physician, and exam-related entities in faxed referral documents. We explore some of the challenges in applying a document understanding model to referrals, which have formats varying by medical practice, and evaluate model performance using MUC-5 metrics to obtain appropriate metrics for the practical use case. Our analysis shows the addition of domain-specific rules to the transformer model yields greatly increased precision and F1 scores, suggesting a hybrid model trained on a curated dataset can increase efficiency in referral management.

annotation, entity type, prediction, (17 more...)

arXiv.org Artificial Intelligence

2309.13184

Country: North America > United States > Maryland > Baltimore (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.35)

Add feedback

FoodGPT: A Large Language Model in Food Testing Domain with Incremental Pre-training and Knowledge Graph Prompt

Qi, Zhixiao, Yu, Yijiong, Tu, Meiqi, Tan, Junyi, Huang, Yongfeng

arXiv.org Artificial IntelligenceAug-20-2023

Large language models (LLM) [1] have gained significant research importance in the field of natural language processing. Models such as ChatGPT, LLaMA [2], GPT-4, ChatGLM [3], and PaLM [4] have demonstrated outstanding performance in downstream tasks. The powerful ability of LLM in understanding human instructions has led to continuous research on LLMs in various vertical domains. ChatLaw [5] is based on Ziya-LLaMA-13B and utilizes legal data for instruction fine-tuning, incorporating vector database retrieval to create a legal LLM. DoctorGLM [6] is built upon ChatGLM-6B and fine-tuned using Chinese medical dialogue datasets to create a Chinese medical consultation model. BenTsao is based on LLaMA-7B and constructs a Chinese medical LLM by leveraging a medical knowledge graph and the GPT-3.5 API to build a Chinese medical instruction dataset. Cornucopia, on the other hand, is based on LLaMA-7B and constructs an instruction dataset using Chinese financial public data and crawled financial data, focusing on question-answering in the financial domain. Previous research assume that the base models have already injected the corresponding domain knowledge, hence no incremental pre-training is performed on the base models.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2308.10173

Country: Asia > China > Beijing > Beijing (0.06)

Genre: Research Report (0.50)

Industry: Law (0.55)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Transformer-Based UNet with Multi-Headed Cross-Attention Skip Connections to Eliminate Artifacts in Scanned Documents

Kreuzer, David, Munz, Michael

arXiv.org Artificial IntelligenceJun-5-2023

The extraction of text in high quality is essential for text-based document analysis tasks like Document Classification or Named Entity Recognition. Unfortunately, this is not always ensured, as poor scan quality and the resulting artifacts lead to errors in the Optical Character Recognition (OCR) process. Current approaches using Convolutional Neural Networks show promising results for background removal tasks but fail correcting artifacts like pixelation or compression errors. For general images, Transformer backbones are getting integrated more frequently in well-known neural network structures for denoising tasks. In this work, a modified UNet structure using a Swin Transformer backbone is presented to remove typical artifacts in scanned documents. Multi-headed cross-attention skip connections are used to more selectively learn features in respective levels of abstraction. The performance of this approach is examined regarding compression errors, pixelation and random noise. An improvement in text extraction quality with a reduced error rate of up to 53.9% on the synthetic data is archived. The pretrained base-model can be easily adapted to new artifacts. The cross-attention skip connections allow to integrate textual information extracted from the encoder or in form of commands to more selectively control the models outcome. The latter is shown by means of an example application.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2306.02815

Country: Asia > Malaysia > Kuala Lumpur > Kuala Lumpur (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)

Add feedback

End-to-End Document Classification and Key Information Extraction using Assignment Optimization

Cooney, Ciaran, Cavadas, Joana, Madigan, Liam, Savage, Bradley, Heyburn, Rachel, O'Cuinn, Mairead

arXiv.org Artificial IntelligenceJun-1-2023

We propose end-to-end document classification and key information extraction (KIE) for automating document processing in forms. Through accurate document classification we harness known information from templates to enhance KIE from forms. We use text and layout encoding with a cosine similarity measure to classify visually-similar documents. We then demonstrate a novel application of mixed integer programming by using assignment optimization to extract key information from documents. Our approach is validated on an in-house dataset of noisy scanned forms. The best performing document classification approach achieved 0.97 f1 score. A mean f1 score of 0.94 for the KIE task suggests there is significant potential in applying optimization techniques. Abation results show that the method relies on document preprocessing techniques to mitigate Type II errors and achieve optimal performance.

machine learning, natural language, template, (22 more...)

arXiv.org Artificial Intelligence

2306.0075

Country:

Europe > United Kingdom > Northern Ireland (0.04)
Europe > Ireland > Munster (0.04)

Genre:

Research Report (1.00)
Overview > Innovation (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

OCR is getting super cool for Businesses

#artificialintelligenceAug-29-2022, 06:55:06 GMT

A Few months back, the student in class captured the image of the notes made by the other student in front of him and used iOS 15's recent text-recognition feature to highlight text, and copy and paste it into his notes. This instance was tweeted by @juanbuis, who shared the video of a student making the most of iOS 15's Live Text OCR feature. This cool OCR or Optical Character Recognition feature that the above student opts for is generally used to pull up the information from the text or documents and then convert it into the machine's language. Recently, the popular app developer Alessandro Paluzzi has also seen that Twitter is working on an OCR (optical character recognition) feature for the description of alt text. In his tweet, Alessandro Paluzzi shared the demonstration of how this twitter feature will function through a short video. At Dwarf AI we too want to make this super cool technology to be easily accessible by other businesses.

dwarf ai ocr solution, information, ocr solution, (10 more...)

#artificialintelligence

Industry: Information Technology (0.57)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.78)

Add feedback