AITopics | unstructured document

Collaborating Authors

unstructured document

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

Neural Information Processing SystemsFeb-16-2026, 01:34:02 GMT

Models (LLMs) in collaborating with external data, yet significant challenges exist in real-world scenarios. In areas such as academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats, which can be lengthy and highly unstructured.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Asia > Indonesia > Bali (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Media (0.46)
Banking & Finance (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

Neural Information Processing SystemsOct-10-2025, 06:57:41 GMT

arxiv preprint arxiv, computational linguistic, dataset, (13 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Asia > Indonesia > Bali (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Media (0.46)
Banking & Finance (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

Miao, Ziyang, Sun, Qiyu, Wang, Jingyuan, Gong, Yuchen, Zheng, Yaowei, Li, Shiqi, Zhang, Richong

arXiv.org Artificial IntelligenceJul-8-2025

Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of intermediate outputs to ensure data quality. Experiments on a financial question-answering task show that fine-tuning LLMs on the synthesized dataset significantly improves domain-specific performance while preserving general knowledge. The source code and installable package are available at https://github.com/ConardLi/easy-dataset and have garnered over 9,000 GitHub stars.

arxiv preprint arxiv, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2507.04009

Country: Asia > Thailand (0.14)

Genre: Research Report (0.64)

Industry:

Law > Taxation Law (0.46)
Government > Tax (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

ViBERTgrid BiLSTM-CRF: Multimodal Key Information Extraction from Unstructured Financial Documents

Pala, Furkan, Akpınar, Mehmet Yasin, Deniz, Onur, Eryiğit, Gülşen

arXiv.org Artificial IntelligenceSep-23-2024

Multimodal key information extraction (KIE) models have been studied extensively on semi-structured documents. However, their investigation on unstructured documents is an emerging research topic. The paper presents an approach to adapt a multimodal transformer (i.e., ViBERTgrid previously explored on semi-structured documents) for unstructured financial documents, by incorporating a BiLSTM-CRF layer. The proposed ViBERTgrid BiLSTM-CRF model demonstrates a significant improvement in performance (up to 2 percentage points) on named entity recognition from unstructured documents in financial domain, while maintaining its KIE performance on semi-structured documents. As an additional contribution, we publicly released token-level annotations for the SROIE dataset in order to pave the way for its use in multimodal sequence labeling models.

dataset, extraction, information, (14 more...)

arXiv.org Artificial Intelligence

2409.15004

Country:

Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Switzerland > Geneva > Geneva (0.04)

Genre: Research Report (1.00)

Industry: Banking & Finance (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

Hui, Yulong, Lu, Yao, Zhang, Huanchen

arXiv.org Artificial IntelligenceJun-21-2024

The use of Retrieval-Augmented Generation (RAG) has improved Large Language Models (LLMs) in collaborating with external data, yet significant challenges exist in real-world scenarios. In areas such as academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats, which can be lengthy and highly unstructured. In this paper, we introduce a benchmark suite, namely Unstructured Document Analysis (UDA), that involves 2,965 real-world documents and 29,590 expert-annotated Q&A pairs. We revisit popular LLM- and RAG-based solutions for document analysis and evaluate the design choices and answer qualities across multiple document domains and diverse query types. Our evaluation yields interesting findings and highlights the importance of data parsing and retrieval. We hope our benchmark can shed light and better serve real-world document analysis applications. The benchmark suite and code can be found at https://github.com/qinchuanhui/UDA-Benchmark.

arxiv preprint arxiv, dataset, language model, (12 more...)

arXiv.org Artificial Intelligence

2406.15187

Country:

Asia > Singapore (0.05)
South America (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(8 more...)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (0.67)
Banking & Finance (0.46)
Media > Television (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Docs2KG: Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models

Sun, Qiang, Luo, Yuanyi, Zhang, Wenxiao, Li, Sirui, Li, Jichunyang, Niu, Kai, Kong, Xiangrui, Liu, Wei

arXiv.org Artificial IntelligenceJun-5-2024

Even for a conservative estimate, 80% of enterprise data reside in unstructured files, stored in data lakes that accommodate heterogeneous formats. Classical search engines can no longer meet information seeking needs, especially when the task is to browse and explore for insight formulation. In other words, there are no obvious search keywords to use. Knowledge graphs, due to their natural visual appeals that reduce the human cognitive load, become the winning candidate for heterogeneous data integration and knowledge representation. In this paper, we introduce Docs2KG, a novel framework designed to extract multimodal information from diverse and heterogeneous unstructured documents, including emails, web pages, PDF files, and Excel files. Dynamically generates a unified knowledge graph that represents the extracted key information, Docs2KG enables efficient querying and exploration of document data lakes. Unlike existing approaches that focus on domain-specific data sources or pre-designed schemas, Docs2KG offers a flexible and extensible solution that can adapt to various document structures and content types. The proposed framework unifies data processing supporting a multitude of downstream tasks with improved domain interpretability. Docs2KG is publicly accessible at https://docs2kg.ai4wa.com, and a demonstration video is available at https://docs2kg.ai4wa.com/Video.

docs2kg, information, pdf file, (10 more...)

arXiv.org Artificial Intelligence

2406.02962

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Oceania > Australia > Western Australia > Perth (0.08)
Asia > China > Hong Kong (0.05)
(4 more...)

Genre: Research Report (0.40)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.84)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.52)

Add feedback

ABBYY: Fighting Financial Fraud With Artificial Intelligence

#artificialintelligenceSep-5-2021, 03:43:33 GMT

Of course, there is no shortage of data in financial services -- structured, unstructured, transactional, account-level – but while this data brings benefits, when in the hands of nefarious actors, also make fraud more pervasive. Neil Murphy is the Global VP at ABBYY, and believes AI is the way forward to tackle the ever rising cases of fraud. Along with other benefits, Murphy explains by using AI, financial organisations can reduce manual steps required in the onboarding stage, and process both structured and unstructured documents. In doing so, financial organisations gain a birds-eye view and can filter out suspicious and fraudulent actors. Technological advancements, an increase in investment into security systems and fraud prevention initiatives have been widely adopted by the finance industry in an effort to curb scams and crises from occurring.

financial institution, financial organisation, fraud, (13 more...)

#artificialintelligence

Industry:

Information Technology > Security & Privacy (1.00)
Banking & Finance (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.31)

Add feedback

Codelitt and Box transform unstructured documents into actionable data

#artificialintelligenceNov-2-2020, 19:10:33 GMT

Codelitt uses technology and user centric design to solve corporate problems with start-up speed, technology, and innovation. They focus on the build side to develop scalable solutions in platforms such as Web Mobile, AR/VR, AI/ML, Robotics and IoT for large enterprises and offer a full stack of services from idea validation/ ideation, design, and development. Codelitt partners with Box to create new opportunities in enterprise digital content management and has developed Ada, a custom application which utilizes machine learning and Box Skills to intelligently extract actionable data from documents. Manual processes for data retrieval/data entry, are still prevalent in many large enterprises across a variety of industries. For most, a vast amount of data and information remains unharnessed because it lives inside unstructured documents.

actionable data, box transform unstructured document, unstructured document, (7 more...)

#artificialintelligence

Country: North America > United States > Florida > Miami-Dade County > Miami (0.06)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.58)

Add feedback

Using Amazon Textract with Amazon Augmented AI for processing critical documents Amazon Web Services

#artificialintelligenceApr-30-2020, 04:26:13 GMT

Documents are a primary tool for record keeping, communication, collaboration, and transactions across many industries, including financial, medical, legal, and real estate. For example, millions of mortgage applications and hundreds of millions of tax forms are processed each year. Documents are often unstructured, which means the content's location or format may vary between two otherwise similar forms. Unstructured documents require time-consuming and complex processes to enable search and discovery, business process automation, and compliance control. When using machine learning (ML) to automate processing of these unstructured documents, you can now build in human reviews to aid in managing sensitive workflows that require human judgment.

amazon augmented ai, amazon textract, human review, (13 more...)

#artificialintelligence

Industry:

Banking & Finance (0.55)
Retail > Online (0.40)
Information Technology > Services (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.88)

Add feedback

Tag-Weighted Topic Model For Large-scale Semi-Structured Documents

Li, Shuangyin, Li, Jiefei, Huang, Guan, Tan, Ruiyang, Pan, Rong

arXiv.org Machine LearningJul-30-2015

To date, there have been massive Semi-Structured Documents (SSDs) during the evolution of the Internet. These SSDs contain both unstructured features (e.g., plain text) and metadata (e.g., tags). Most previous works focused on modeling the unstructured text, and recently, some other methods have been proposed to model the unstructured text with specific tags. To build a general model for SSDs remains an important problem in terms of both model fitness and efficiency. We propose a novel method to model the SSDs by a so-called Tag-Weighted Topic Model (TWTM). TWTM is a framework that leverages both the tags and words information, not only to learn the document-topic and topic-word distributions, but also to infer the tag-topic distributions for text mining tasks. We present an efficient variational inference method with an EM algorithm for estimating the model parameters. Meanwhile, we propose three large-scale solutions for our model under the MapReduce distributed computing platform for modeling large-scale SSDs. The experimental results show the effectiveness, efficiency and the robustness by comparing our model with the state-of-the-art methods in document modeling, tags prediction and text classification. We also show the performance of the three distributed solutions in terms of time and accuracy on document modeling.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

1507.08396

Country:

Asia (0.46)
North America > United States (0.28)

Genre:

Research Report > Promising Solution (0.54)
Research Report > New Finding (0.34)

Industry:

Education (0.67)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.87)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback