extraction task
ClinBench: A Standardized Multi-Domain Framework for Evaluating Large Language Models in Clinical Information Extraction
Large Language Models (LLMs) offer substantial promise for clinical natural language processing (NLP); however, a lack of standardized benchmarking methodologies limits their objective evaluation and practical translation. To address this gap, we introduce ClinBench, an open-source, multi-model, multi-domain benchmarking framework. ClinBench is designed for the rigorous evaluation of LLMs on important structured information extraction tasks (e.g., tumor staging, histologic diagnoses, atrial fibrillation, and social determinants of health) from unstructured clinical notes. The framework standardizes the evaluation pipeline by: (i) operating on consistently structured input datasets; (ii) employing dynamic, YAML-based prompting for uniform task definition; and (iii) enforcing output validation via JSON schemas, supporting robust comparison across diverse LLM architectures. We demonstrate ClinBench through a large-scale study of 11 prominent LLMs (e.g., GPT-4o series, LLaMA3 variants, Mixtral) across three clinical domains using configurations of public datasets (TCGA for lung cancer, MIMIC-IV-ECG for atrial fibrillation, and MIMIC notes for SDOH). Our results reveal significant performance-efficiency trade-offs. For example, when averaged across the four benchmarked clinical extraction tasks, GPT-3.5-turbo
Supervised Fine Tuning of Large Language Models for Domain Specific Knowledge Graph Construction:A Case Study on Hunan's Historical Celebrities
Hao, Junjie, Wang, Chun, Qiao, Ying, Zuo, Qiuyue, Song, Qiya, Ma, Hua, Gao, Xieping
Large language models and knowledge graphs hold broad application potential in the field of historical culture, facilitating the excavation, research, and comprehension of cultural heritage. Taking Hunan's historical celebrities emerging from modern Huxiang culture as a case, pre-trained large models can assist researchers in rapidly extracting specific historical figure information from literature--including basic details, life events, and social relationships--and constructing structured knowledge graphs, thereby supporting related research. Currently, systematic data collection on Hunan's historical celebrities remains scarce. Moreover, general-purpose large language models often exhibit insufficient domain knowledge extraction accuracy and weak structured output capabilities in such low-resource scenarios. Therefore, this paper proposes a supervised fine-tuning approach for domain-specific large models to enhance the quality and efficiency of information extraction regarding Hunan's historical celebrities. Specifically, this paper first designs a fine-grained schema-guided instruction fine-tuning template for the Hunan's historical celebrities domain. Using this template, we construct an instruction fine-tuning dataset, addressing the current lack of instruction datasets in domain-specific model fine-tuning. Second,we conducted parameter-efficient instruction fine-tuning on four publicly available large language models--Qwen2.5-7B, Qwen3-8B, DeepSeek-R1-Distill-Qwen-7B, and Llama-3.1-8B-Instruct--using the proposed instruction dataset, and established evaluation criteria for assessing their performance in character information extraction. Experimental results demonstrate that the performance of all four base models significantly improved after domain-specific fine-tuning. Among them, Qwen3-8B achieved the best performance after training with 100 samples and 50 fine-tuning iterations, scoring 89.3866 on the evaluation metrics. This research offers new insights for fine-tuning vertical large models tailored to regional historical and cultural domains, holding significant implications for promoting the cost-effective application of large models and knowledge graphs in the field of historical and cultural heritage. Introduction With the rapid advancement of large language models (LLMs), unprecedented opportunities have emerged for the in-depth exploration, systematic research, and widespread dissemination of Huxiang culture. Simultaneously, this presents new challenges for the digital transformation of traditional cultural resources[1].
Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages
Spaanderman, Douwe J., Prathaban, Karthik, Zelina, Petr, Mouheb, Kaouther, Hejtmรกnek, Lukรกลก, Marzetti, Matthew, Schurink, Antonius W., Chan, Damian, Niemantsverdriet, Ruben, Hartmann, Frederik, Qian, Zhen, Thomeer, Maarten G. J., Holub, Petr, Akram, Farhan, Wolters, Frank J., Vernooij, Meike W., Verhoef, Cornelis, Bron, Esther E., Novรกฤek, Vรญt, Grรผnhagen, Dirk J., Niessen, Wiro J., Starmans, Martijn P. A., Klein, Stefan
Large language models (LLMs) are increasingly used to extract structured information from free-text clinical records, but prior work often focuses on single tasks, limited models, and English-language reports. We evaluated 15 open-weight LLMs on pathology and radiology reports across six use cases, colorectal liver metastases, liver tumours, neurodegenerative diseases, soft-tissue tumours, melanomas, and sarcomas, at three institutes in the Netherlands, UK, and Czech Republic. Models included general-purpose and medical-specialised LLMs of various sizes, and six prompting strategies were compared: zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graph. Performance was assessed using task-appropriate metrics, with consensus rank aggregation and linear mixed-effects models quantifying variance. Top-ranked models achieved macro-average scores close to inter-rater agreement across tasks. Small-to-medium general-purpose models performed comparably to large models, while tiny and specialised models performed worse. Prompt graph and few-shot prompting improved performance by ~13%. Task-specific factors, including variable complexity and annotation variability, influenced results more than model size or prompting strategy. These findings show that open-weight LLMs can extract structured data from clinical reports across diseases, languages, and institutions, offering a scalable approach for clinical data curation.
Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation
Wu, Zonghan, Zou, Congyuan, Wang, Junlin, Wang, Chenhan, Yang, Hangjing, Shao, Yilei
Generative AI, particularly large language models (LLMs), is beginning to transform the financial industry by automating tasks and helping to make sense of complex financial information. One especially promising use case is the automatic creation of fundamental analysis reports, which are essential for making informed investment decisions, evaluating credit risks, guiding corporate mergers, etc. While LLMs attempt to generate these reports from a single prompt, the risks of inaccuracy are significant. Poor analysis can lead to misguided investments, regulatory issues, and loss of trust. Existing financial benchmarks mainly evaluate how well LLMs answer financial questions but do not reflect performance in real-world tasks like generating financial analysis reports. In this paper, we propose FinAR-Bench, a solid benchmark dataset focusing on financial statement analysis, a core competence of fundamental analysis. To make the evaluation more precise and reliable, we break this task into three measurable steps: extracting key information, calculating financial indicators, and applying logical reasoning. This structured approach allows us to objectively assess how well LLMs perform each step of the process. Our findings offer a clear understanding of LLMs current strengths and limitations in fundamental analysis and provide a more practical way to benchmark their performance in real-world financial settings.
Metadata Extraction Leveraging Large Language Models
The advent of Large Language Models has revolutionized tasks across domains, including the automation of legal document analysis, a critical component of modern contract management systems. This paper presents a comprehensive implementation of LLM-enhanced metadata extraction for contract review, focusing on the automatic detection and annotation of salient legal clauses. Leveraging both the publicly available Contract Understanding Atticus Dataset (CUAD) and proprietary contract datasets, our work demonstrates the integration of advanced LLM methodologies with practical applications. We identify three pivotal elements for optimizing metadata extraction: robust text conversion, strategic chunk selection, and advanced LLM-specific techniques, including Chain of Thought (CoT) prompting and structured tool calling. The results from our experiments highlight the substantial improvements in clause identification accuracy and efficiency. Our approach shows promise in reducing the time and cost associated with contract review while maintaining high accuracy in legal clause identification. The results suggest that carefully optimized LLM systems could serve as valuable tools for legal professionals, potentially increasing access to efficient contract review services for organizations of all sizes.
PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction
Shrimal, Anubhav, Jain, Aryan, Chowdhury, Soumyajit, Yenigalla, Promod
Structured information extraction from unstructured text is critical for emerging Software 3.0 systems where LLM agents autonomously interact with APIs and tools. Recent approaches apply large language models directly to extraction tasks using existing JSON schemas, often with constraint decoding or reinforcement learning approaches to ensure syntactic validity, but treat JSON schemas as static contracts designed for human developers, leading to suboptimal extraction performance, frequent hallucinations, and unreliable agent behavior when schemas contain ambiguous or incomplete specifications. We recognize that JSON schemas themselves are a form of natural language understanding contract that encodes rules, relationships, and expectations about data structure contracts that LLMs should be able to both interpret and systematically improve. Consequently, we develop PARSE (Parameter Automated Refinement and Schema Extraction), a novel system with two synergistic components: ARCHITECT, which autonomously optimizes JSON schemas for LLM consumption while maintaining backward compatibility through RELAY (an integrated code generation system), and SCOPE, which implements reflection-based extraction with combined static and LLM-based guardrails. We evaluate PARSE qualitatively and quantitatively on three datasets including Schema-Guided Dialogue (SGD), Structured Web Data Extraction (SWDE), and internal retail conversation data, and find that it achieves up to 64.7% improvement in extraction accuracy on SWDE with combined framework improvements reaching 10% across models, while reducing extraction errors by 92% within the first retry and and maintaining practical latency.
Benchmarking Agentic Systems in Automated Scientific Information Extraction with ChemX
Vepreva, Anastasia, Razlivina, Julia, Eremeeva, Maria, Gubina, Nina, Orlova, Anastasia, Dmitrenko, Aleksei, Kapranova, Ksenya, Jyakhwo, Susan, Vasilev, Nikita, Sarkisyan, Arsen, Chernyshov, Ivan Yu., Vinogradov, Vladimir, Dmitrenko, Andrei
The emergence of agent-based systems represents a significant advancement in artificial intelligence, with growing applications in automated data extraction. However, chemical information extraction remains a formidable challenge due to the inherent heterogeneity of chemical data. Current agent-based approaches, both general-purpose and domain-specific, exhibit limited performance in this domain. To address this gap, we present ChemX, a comprehensive collection of 10 manually curated and domain-expert-validated datasets focusing on nanomaterials and small molecules. These datasets are designed to rigorously evaluate and enhance automated extraction methodologies in chemistry. To demonstrate their utility, we conduct an extensive benchmarking study comparing existing state-of-the-art agentic systems such as ChatGPT Agent and chemical-specific data extraction agents. Additionally, we introduce our own single-agent approach that enables precise control over document preprocessing prior to extraction. We further evaluate the performance of modern baselines, such as GPT-5 and GPT-5 Thinking, to compare their capabilities with agentic approaches. Our empirical findings reveal persistent challenges in chemical information extraction, particularly in processing domain-specific terminology, complex tabular and schematic representations, and context-dependent ambiguities. The ChemX benchmark serves as a critical resource for advancing automated information extraction in chemistry, challenging the generalization capabilities of existing methods, and providing valuable insights into effective evaluation strategies.
Extract-0: A Specialized Language Model for Document Information Extraction
This paper presents Extract-0, a 7-billion parameter language model specifically optimized for document information extraction that achieves performance exceeding models with parameter counts several orders of magnitude larger. Through a novel combination of synthetic data generation, supervised fine-tuning with Low-Rank Adaptation (LoRA), and reinforcement learning via Group Relative Policy Optimization (GRPO), Extract-0 achieves a mean reward of 0.573 on a benchmark of 1,000 diverse document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459). The training methodology employs a memory-preserving synthetic data generation pipeline that produces 280,128 training examples from diverse document sources, followed by parameterefficient fine-tuning that modifies only 0.53% of model weights (40.4M out of 7.66B parameters). The reinforcement learning phase introduces a novel semantic similarity-based reward function that handles the inherent ambiguity in information extraction tasks. This research demonstrates that task-specific optimization can yield models that surpass general-purpose systems while requiring substantially fewer computational resource.