AITopics

Country: Europe > Russia (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)
Materials > Chemicals > Commodity Chemicals > Petrochemicals (0.68)
Energy (0.67)

Technology:

Information Technology > Data Science > Data Mining > Text Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsJun-13-2026, 13:13:29 GMT

ChemX: A Collection of Chemistry Datasets for Benchmarking Automated Information Extraction

information extraction, large language model, natural language, (8 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.81)

Neural Information Processing SystemsFeb-11-2026, 14:47:53 GMT

3d7025dc9bd4c8b6fb1eef80cc618008-Paper-Datasets_and_Benchmarks_Track.pdf

computational linguistic, large language model, machine learning, (18 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
Asia > Singapore (0.05)
Asia > Indonesia > Bali (0.04)
(19 more...)

Genre:

Research Report > New Finding (0.93)
Overview (0.68)

Industry:

Banking & Finance > Trading (0.47)
Education > Curriculum > Subject-Specific Education (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Neural Information Processing SystemsFeb-9-2026, 10:33:44 GMT

63943ee9fe347f3d95892cf87d9a42e6-Paper-Conference.pdf

computational linguistic, extraction, proceedings, (13 more...)

Country:

Asia > Singapore (0.04)
Asia > China > Hubei Province > Wuhan (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

arXiv.org Artificial IntelligenceDec-12-2025

HunyuanOCR Technical Report

Hunyuan Vision Team, null, Lyu, Pengyuan, Wan, Xingyu, Li, Gengluo, Peng, Shangpin, Wang, Weinong, Wu, Liang, Shen, Huawen, Zhou, Yu, Tang, Canhui, Yang, Qi, Peng, Qiming, Luo, Bin, Yang, Hower, Zhang, Xinsong, Zhang, Jinnian, Peng, Houwen, Yang, Hongming, Xie, Senhao, Zhou, Longsha, Pei, Ge, Wu, Binghong, Yan, Rui, Wu, Kan, Yang, Jieneng, Wang, Bochao, Liu, Kai, Zhu, Jianchen, Jiang, Jie, Linus, null, Hu, Han, Zhang, Chengquan

This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

large language model, machine learning, natural language, (18 more...)

2511.19575

Country: Asia (0.28)

Genre:

Workflow (1.00)
Research Report (1.00)

Industry:

Education (1.00)
Media (0.67)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Hemmer, Arthur, Coustaty, Mickaël, Bartolo, Nicola, Ogier, Jean-Marc

Neurosymbolic Information Extraction from Transactional Documents

arXiv.org Artificial IntelligenceDec-11-2025

This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in $F_1$-scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.

constraint, large language model, machine learning, (17 more...)

doi: 10.1007/s10032-025-00530-0

2512.09666

Country:

Europe (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Data Science > Data Mining > Text Mining (0.72)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Nonesung, Surapon, Jaknamon, Teetouch, Chaiophat, Sirinya, Nitarach, Natapong, Wittayasakpan, Chanakan, Sirichotedumrong, Warit, Na-Thalang, Adisai, Pipatanakul, Kunat

ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai

arXiv.org Artificial IntelligenceDec-5-2025

We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.

large language model, machine learning, natural language, (22 more...)

2511.04479

Country: North America > United States (0.93)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

arXiv.org Artificial IntelligenceNov-25-2025

Information Extraction From Fiscal Documents Using LLMs

Aggarwal, Vikram, Kulkarni, Jay, Mascarenhas, Aditi, Narang, Aakriti, Raman, Siddarth, Shah, Ajay, Thomas, Susan

Large Language Models (LLMs) have demonstrated remarkable capabilities in text comprehension, but their ability to process complex, hierarchical tabular data remains underexplored. We present a novel approach to extracting structured data from multi-page government fiscal documents using LLM-based techniques. Applied to annual fiscal documents from the State of Karnataka in India (200+ pages), our method achieves high accuracy through a multi-stage pipeline that leverages domain knowledge, sequential context, and algorithmic validation. A large challenge with traditional OCR methods is the inability to verify the accurate extraction of numbers. When applied to fiscal data, the inherent structure of fiscal tables, with totals at each level of the hierarchy, allows for robust internal validation of the extracted data. We use these hierarchical relationships to create multi-level validation checks. We demonstrate that LLMs can read tables and also process document-specific structural hierarchies, offering a scalable process for converting PDF-based fiscal disclosures into research-ready databases. Our implementation shows promise for broader applications across developing country contexts.

information, large language model, natural language, (17 more...)

2511.10659

Country: Asia > India > Karnataka (0.27)

Genre:

Research Report (1.00)
Overview > Innovation (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceNov-24-2025

Supervised Fine Tuning of Large Language Models for Domain Specific Knowledge Graph Construction:A Case Study on Hunan's Historical Celebrities

Hao, Junjie, Wang, Chun, Qiao, Ying, Zuo, Qiuyue, Song, Qiya, Ma, Hua, Gao, Xieping

Large language models and knowledge graphs hold broad application potential in the field of historical culture, facilitating the excavation, research, and comprehension of cultural heritage. Taking Hunan's historical celebrities emerging from modern Huxiang culture as a case, pre-trained large models can assist researchers in rapidly extracting specific historical figure information from literature--including basic details, life events, and social relationships--and constructing structured knowledge graphs, thereby supporting related research. Currently, systematic data collection on Hunan's historical celebrities remains scarce. Moreover, general-purpose large language models often exhibit insufficient domain knowledge extraction accuracy and weak structured output capabilities in such low-resource scenarios. Therefore, this paper proposes a supervised fine-tuning approach for domain-specific large models to enhance the quality and efficiency of information extraction regarding Hunan's historical celebrities. Specifically, this paper first designs a fine-grained schema-guided instruction fine-tuning template for the Hunan's historical celebrities domain. Using this template, we construct an instruction fine-tuning dataset, addressing the current lack of instruction datasets in domain-specific model fine-tuning. Second,we conducted parameter-efficient instruction fine-tuning on four publicly available large language models--Qwen2.5-7B, Qwen3-8B, DeepSeek-R1-Distill-Qwen-7B, and Llama-3.1-8B-Instruct--using the proposed instruction dataset, and established evaluation criteria for assessing their performance in character information extraction. Experimental results demonstrate that the performance of all four base models significantly improved after domain-specific fine-tuning. Among them, Qwen3-8B achieved the best performance after training with 100 samples and 50 fine-tuning iterations, scoring 89.3866 on the evaluation metrics. This research offers new insights for fine-tuning vertical large models tailored to regional historical and cultural domains, holding significant implications for promoting the cost-effective application of large models and knowledge graphs in the field of historical and cultural heritage. Introduction With the rapid advancement of large language models (LLMs), unprecedented opportunities have emerged for the in-depth exploration, systematic research, and widespread dissemination of Huxiang culture. Simultaneously, this presents new challenges for the digital transformation of traditional cultural resources[1].

large language model, machine learning, natural language, (17 more...)

2511.17012

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.87)

Industry:

Health & Medicine (0.46)
Government (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Tworek, Paulina, Bargieł, Miłosz, Khan, Yousef, Pełech-Pilichowski, Tomasz, Mikołajczyk, Marek, Lewandowski, Roman, Sousa, Jose

Balancing Natural Language Processing Accuracy and Normalisation in Extracting Medical Insights

arXiv.org Artificial IntelligenceNov-21-2025

Extracting structured medical insights from unstructured clinical text using Natural Language Processing (NLP) remains an open challenge in healthcare, particularly in non-English contexts where resources are scarce. This study presents a comparative analysis of NLP low-compute rule-based methods and Large Language Models (LLMs) for information extraction from electronic health records (EHR) obtained from the Voivodeship Rehabilitation Hospital for Children in Ameryka, Poland. We evaluate both approaches by extracting patient demographics, clinical findings, and prescribed medications while examining the effects of lack of text normalisation and translation-induced information loss. Results demonstrate that rule-based methods provide higher accuracy in information retrieval tasks, particularly for age and sex extraction. However, LLMs offer greater adaptability and scalability, excelling in drug name recognition. The effectiveness of the LLMs was compared with texts originally in Polish and those translated into English, assessing the impact of translation. These findings highlight the trade-offs between accuracy, normalisation, and computational cost when deploying NLP in healthcare settings. We argue for hybrid approaches that combine the precision of rule-based systems with the adaptability of LLMs, offering a practical path toward more reliable and resource-efficient clinical NLP in real-world hospitals.

information, large language model, machine learning, (20 more...)

2511.15778

Country: Europe > Poland (0.50)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.87)
Health & Medicine > Health Care Providers & Services (0.69)
Health & Medicine > Diagnostic Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)