pdf file
Information Extraction From Fiscal Documents Using LLMs
Aggarwal, Vikram, Kulkarni, Jay, Mascarenhas, Aditi, Narang, Aakriti, Raman, Siddarth, Shah, Ajay, Thomas, Susan
Large Language Models (LLMs) have demonstrated remarkable capabilities in text comprehension, but their ability to process complex, hierarchical tabular data remains underexplored. We present a novel approach to extracting structured data from multi-page government fiscal documents using LLM-based techniques. Applied to annual fiscal documents from the State of Karnataka in India (200+ pages), our method achieves high accuracy through a multi-stage pipeline that leverages domain knowledge, sequential context, and algorithmic validation. A large challenge with traditional OCR methods is the inability to verify the accurate extraction of numbers. When applied to fiscal data, the inherent structure of fiscal tables, with totals at each level of the hierarchy, allows for robust internal validation of the extracted data. We use these hierarchical relationships to create multi-level validation checks. We demonstrate that LLMs can read tables and also process document-specific structural hierarchies, offering a scalable process for converting PDF-based fiscal disclosures into research-ready databases. Our implementation shows promise for broader applications across developing country contexts.
- Asia > India > Karnataka (0.27)
- Asia > Singapore (0.07)
- Asia > India > Maharashtra > Mumbai (0.05)
- (2 more...)
- Research Report (1.00)
- Overview > Innovation (0.34)
- Transportation > Passenger (1.00)
- Materials > Metals & Mining (1.00)
- Law > Environmental Law (1.00)
- (15 more...)
Meet Your New Client: Writing Reports for AI -- Benchmarking Information Loss in Market Research Deliverables
Simmering, Paul F., Schulz, Benedikt, Tabino, Oliver, Wittenburg, Georg
As organizations adopt retrieval-augmented generation (RAG) for their knowledge management systems (KMS), traditional market research deliverables face new functional demands. While PDF reports and slides have long served human readers, they are now also "read" by AI systems to answer user questions. To future-proof reports being delivered today, this study evaluates information loss during their ingestion into RAG systems. It compares how well PDF and PowerPoint (PPTX) documents converted to Markdown can be used by an LLM to answer factual questions in an end-to-end benchmark. Findings show that while text is reliably extracted, significant information is lost from complex objects like charts and diagrams. This suggests a need for specialized, AI-native deliverables to ensure research insights are not lost in translation.
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.40)
- North America > United States > New York > New York County > New York City (0.04)
- Transportation > Passenger (1.00)
- Materials > Metals & Mining (1.00)
- Industrial Conglomerates (1.00)
- (12 more...)
NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering
Cao, Ruisheng, Zhang, Hanchong, Huang, Tiancheng, Kang, Zhangyi, Zhang, Yuxin, Sun, Liangtai, Li, Hanqi, Miao, Yuxun, Fan, Shuai, Chen, Lu, Yu, Kai
The increasing number of academic papers poses significant challenges for researchers to efficiently acquire key details. While retrieval augmented generation (RAG) shows great promise in large language model (LLM) based automated question answering, previous works often isolate neural and symbolic retrieval despite their complementary strengths. Moreover, conventional single-view chunking neglects the rich structure and layout of PDFs, e.g., sections and tables. In this work, we propose NeuSym-RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process. By leveraging multi-view chunking and schema-based parsing, NeuSym-RAG organizes semi-structured PDF content into both the relational database and vectorstore, enabling LLM agents to iteratively gather context until sufficient to generate answers. Experiments on three full PDF-based QA datasets, including a self-annotated one AIRQA-REAL, show that NeuSym-RAG stably defeats both the vector-based RAG and various structured baselines, highlighting its capacity to unify both retrieval schemes and utilize multiple views. Code and data are publicly available at https://github.com/X-LANCE/NeuSym-RAG.
- Europe > Austria > Vienna (0.14)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Texas > Taylor County (0.04)
- (6 more...)
Adobe Acrobat Pro review: Still the gold standard
Acrobat Pro's comprehensive PDF features show why it's still the editor against which all others are judged. Editor's note: This review was updated December 9, 2024 to reflect the addition of AI Assistant and current pricing. Adobe created the PDF two decades ago and its PDF editor has continued to rule the category, despite what many users felt was its exorbitant price. But a couple of years back, Acrobat adopted a cloud subscription model that now makes it more affordable for folks without an enterprise budget. Acrobat Pro is composed of three components: Acrobat, which allows you to perform a variety of editing functions on your PDFs on desktop and mobile devices; Adobe Document Cloud, which lets you create and export PDF files, as well as store and send files and collect electronic signatures; and Acrobat Reader, which enables you to read, print, and sign PDFs.
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Communications > Mobile (0.35)
Polish Medical Exams: A new dataset for cross-lingual medical knowledge transfer assessment
Grzybowski, Łukasz, Pokrywka, Jakub, Ciesiółka, Michał, Kaczmarek, Jeremi I., Kubis, Marek
Large Language Models (LLMs) have demonstrated significant potential in handling specialized tasks, including medical problem-solving. However, most studies predominantly focus on English-language contexts. This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams (LEK, LDEK, PES) taken by medical doctor candidates and practicing doctors pursuing specialization. The dataset was web-scraped from publicly available resources provided by the Medical Examination Center and the Chief Medical Chamber. It comprises over 24,000 exam questions, including a subset of parallel Polish-English corpora, where the English portion was professionally translated by the examination center for foreign candidates. By creating a structured benchmark from these existing exam questions, we systematically evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and Polish-specific models, and compare their performance against human medical students. Our analysis reveals that while models like GPT-4o achieve near-human performance, significant challenges persist in cross-lingual translation and domain-specific understanding. These findings underscore disparities in model performance across languages and medical specialties, highlighting the limitations and ethical considerations of deploying LLMs in clinical practice.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Poland > Greater Poland Province > Poznań (0.04)
PDFs are now even easier to work with thanks to the new AI features in PDFelement
PDFelement is already a well-established name when it comes to working with PDFs, thanks to its impressive range of features and affordable price. But the developers at Wondershare haven't rested on their laurels, as a new upgrade brings a host of AI tools and enhancements that will make it even easier to edit, annotate, extract information and share the results. If you regularly deal with PDF files, the updated PDFelements version 11 release could be about to make your life a whole lot simpler. The AI revolution is well underway, and the updated PDFelement brings AI-powered tools that are focussed on improving how users interact with PDF files. With these new abilities you can get work done in the least amount of time and with a minimum of fuss.
Reviews: Dimensionality Reduction has Quantifiable Imperfections: Two Geometric Bounds
This paper investigates Dimensionality Reduction (DR) maps in an information retrieval setting. In particular, they showed that no DR map can attain both perfect precision and perfect recall. Further, they showed the theoretical bounds for the precision and the Wasserstein distance of a continuous DR map. They also run simulations in various settings. Quality: They have theoretical equivalences of precision and recall (Proposition 1) and show that perfect map does not exist (Theorem 1).
KaPQA: Knowledge-Augmented Product Question-Answering
Eppalapally, Swetha, Dangi, Daksh, Bhat, Chaithra, Gupta, Ankita, Zhang, Ruiyi, Agarwal, Shubham, Bagga, Karishma, Yoon, Seunghyun, Lipka, Nedim, Rossi, Ryan A., Dernoncourt, Franck
Question-answering for domain-specific applications has recently attracted much interest due to the latest advancements in large language models (LLMs). However, accurately assessing the performance of these applications remains a challenge, mainly due to the lack of suitable benchmarks that effectively simulate real-world scenarios. To address this challenge, we introduce two product question-answering (QA) datasets focused on Adobe Acrobat and Photoshop products to help evaluate the performance of existing models on domain-specific product QA tasks. Additionally, we propose a novel knowledge-driven RAG-QA framework to enhance the performance of the models in the product QA task. Our experiments demonstrated that inducing domain knowledge through query reformulation allowed for increased retrieval and generative performance when compared to standard RAG-QA methods. This improvement, however, is slight, and thus illustrates the challenge posed by the datasets introduced.
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- North America > Dominican Republic (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Information Technology (0.46)
- Banking & Finance (0.34)