unstructured document
- Asia > Singapore (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- Asia > Indonesia > Bali (0.04)
- (5 more...)
- Media (0.46)
- Banking & Finance (0.46)
- Leisure & Entertainment (0.46)
- Asia > Singapore (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- Asia > Indonesia > Bali (0.04)
- (5 more...)
- Media (0.46)
- Banking & Finance (0.46)
- Leisure & Entertainment (0.46)
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
Miao, Ziyang, Sun, Qiyu, Wang, Jingyuan, Gong, Yuchen, Zheng, Yaowei, Li, Shiqi, Zhang, Richong
Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of intermediate outputs to ensure data quality. Experiments on a financial question-answering task show that fine-tuning LLMs on the synthesized dataset significantly improves domain-specific performance while preserving general knowledge. The source code and installable package are available at https://github.com/ConardLi/easy-dataset and have garnered over 9,000 GitHub stars.
- Law > Taxation Law (0.46)
- Government > Tax (0.46)
ViBERTgrid BiLSTM-CRF: Multimodal Key Information Extraction from Unstructured Financial Documents
Pala, Furkan, Akpınar, Mehmet Yasin, Deniz, Onur, Eryiğit, Gülşen
Multimodal key information extraction (KIE) models have been studied extensively on semi-structured documents. However, their investigation on unstructured documents is an emerging research topic. The paper presents an approach to adapt a multimodal transformer (i.e., ViBERTgrid previously explored on semi-structured documents) for unstructured financial documents, by incorporating a BiLSTM-CRF layer. The proposed ViBERTgrid BiLSTM-CRF model demonstrates a significant improvement in performance (up to 2 percentage points) on named entity recognition from unstructured documents in financial domain, while maintaining its KIE performance on semi-structured documents. As an additional contribution, we publicly released token-level annotations for the SROIE dataset in order to pave the way for its use in multimodal sequence labeling models.
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Switzerland > Geneva > Geneva (0.04)
UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis
Hui, Yulong, Lu, Yao, Zhang, Huanchen
The use of Retrieval-Augmented Generation (RAG) has improved Large Language Models (LLMs) in collaborating with external data, yet significant challenges exist in real-world scenarios. In areas such as academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats, which can be lengthy and highly unstructured. In this paper, we introduce a benchmark suite, namely Unstructured Document Analysis (UDA), that involves 2,965 real-world documents and 29,590 expert-annotated Q&A pairs. We revisit popular LLM- and RAG-based solutions for document analysis and evaluate the design choices and answer qualities across multiple document domains and diverse query types. Our evaluation yields interesting findings and highlights the importance of data parsing and retrieval. We hope our benchmark can shed light and better serve real-world document analysis applications. The benchmark suite and code can be found at https://github.com/qinchuanhui/UDA-Benchmark.
- Asia > Singapore (0.05)
- South America (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (8 more...)
- Leisure & Entertainment (0.67)
- Banking & Finance (0.46)
- Media > Television (0.46)
Docs2KG: Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models
Sun, Qiang, Luo, Yuanyi, Zhang, Wenxiao, Li, Sirui, Li, Jichunyang, Niu, Kai, Kong, Xiangrui, Liu, Wei
Even for a conservative estimate, 80% of enterprise data reside in unstructured files, stored in data lakes that accommodate heterogeneous formats. Classical search engines can no longer meet information seeking needs, especially when the task is to browse and explore for insight formulation. In other words, there are no obvious search keywords to use. Knowledge graphs, due to their natural visual appeals that reduce the human cognitive load, become the winning candidate for heterogeneous data integration and knowledge representation. In this paper, we introduce Docs2KG, a novel framework designed to extract multimodal information from diverse and heterogeneous unstructured documents, including emails, web pages, PDF files, and Excel files. Dynamically generates a unified knowledge graph that represents the extracted key information, Docs2KG enables efficient querying and exploration of document data lakes. Unlike existing approaches that focus on domain-specific data sources or pre-designed schemas, Docs2KG offers a flexible and extensible solution that can adapt to various document structures and content types. The proposed framework unifies data processing supporting a multitude of downstream tasks with improved domain interpretability. Docs2KG is publicly accessible at https://docs2kg.ai4wa.com, and a demonstration video is available at https://docs2kg.ai4wa.com/Video.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Oceania > Australia > Western Australia > Perth (0.08)
- Asia > China > Hong Kong (0.05)
- (4 more...)
ABBYY: Fighting Financial Fraud With Artificial Intelligence
Of course, there is no shortage of data in financial services -- structured, unstructured, transactional, account-level – but while this data brings benefits, when in the hands of nefarious actors, also make fraud more pervasive. Neil Murphy is the Global VP at ABBYY, and believes AI is the way forward to tackle the ever rising cases of fraud. Along with other benefits, Murphy explains by using AI, financial organisations can reduce manual steps required in the onboarding stage, and process both structured and unstructured documents. In doing so, financial organisations gain a birds-eye view and can filter out suspicious and fraudulent actors. Technological advancements, an increase in investment into security systems and fraud prevention initiatives have been widely adopted by the finance industry in an effort to curb scams and crises from occurring.
- Information Technology > Security & Privacy (1.00)
- Banking & Finance (1.00)
Codelitt and Box transform unstructured documents into actionable data
Codelitt uses technology and user centric design to solve corporate problems with start-up speed, technology, and innovation. They focus on the build side to develop scalable solutions in platforms such as Web Mobile, AR/VR, AI/ML, Robotics and IoT for large enterprises and offer a full stack of services from idea validation/ ideation, design, and development. Codelitt partners with Box to create new opportunities in enterprise digital content management and has developed Ada, a custom application which utilizes machine learning and Box Skills to intelligently extract actionable data from documents. Manual processes for data retrieval/data entry, are still prevalent in many large enterprises across a variety of industries. For most, a vast amount of data and information remains unharnessed because it lives inside unstructured documents.
Using Amazon Textract with Amazon Augmented AI for processing critical documents Amazon Web Services
Documents are a primary tool for record keeping, communication, collaboration, and transactions across many industries, including financial, medical, legal, and real estate. For example, millions of mortgage applications and hundreds of millions of tax forms are processed each year. Documents are often unstructured, which means the content's location or format may vary between two otherwise similar forms. Unstructured documents require time-consuming and complex processes to enable search and discovery, business process automation, and compliance control. When using machine learning (ML) to automate processing of these unstructured documents, you can now build in human reviews to aid in managing sensitive workflows that require human judgment.
- Banking & Finance (0.55)
- Retail > Online (0.40)
- Information Technology > Services (0.40)
Tag-Weighted Topic Model For Large-scale Semi-Structured Documents
Li, Shuangyin, Li, Jiefei, Huang, Guan, Tan, Ruiyang, Pan, Rong
To date, there have been massive Semi-Structured Documents (SSDs) during the evolution of the Internet. These SSDs contain both unstructured features (e.g., plain text) and metadata (e.g., tags). Most previous works focused on modeling the unstructured text, and recently, some other methods have been proposed to model the unstructured text with specific tags. To build a general model for SSDs remains an important problem in terms of both model fitness and efficiency. We propose a novel method to model the SSDs by a so-called Tag-Weighted Topic Model (TWTM). TWTM is a framework that leverages both the tags and words information, not only to learn the document-topic and topic-word distributions, but also to infer the tag-topic distributions for text mining tasks. We present an efficient variational inference method with an EM algorithm for estimating the model parameters. Meanwhile, we propose three large-scale solutions for our model under the MapReduce distributed computing platform for modeling large-scale SSDs. The experimental results show the effectiveness, efficiency and the robustness by comparing our model with the state-of-the-art methods in document modeling, tags prediction and text classification. We also show the performance of the three distributed solutions in terms of time and accuracy on document modeling.
- Asia > Middle East > Jordan (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- (2 more...)
- Research Report > Promising Solution (0.54)
- Research Report > New Finding (0.34)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.87)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)