examiner
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Asia > China > Beijing > Beijing (0.04)
- Asia > Singapore (0.04)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Public Health (0.97)
- Information Technology (0.93)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Michigan (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (8 more...)
Benchmarking Foundation Models with Language-Model-as-an-Examiner
Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model's ability to understand and generate language in a manner similar to humans.Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility as various LMs can be adopted as the examiner, and the questions can be constantly updated given more diverse trigger topics. For a more comprehensive and equitable evaluation, we devise three strategies: (1) We instruct the LM examiner to generate questions across a multitude of domains to probe for a broad acquisition, and raise follow-up questions to engage in a more in-depth assessment.
- Asia > Taiwan (0.04)
- Asia > Middle East > Jordan (0.04)
- Leisure & Entertainment > Games (1.00)
- Leisure & Entertainment > Sports (0.68)
- Asia > Taiwan (0.04)
- Asia > Middle East > Jordan (0.04)
PANORAMA: A Dataset and Benchmarks Capturing Decision Trails and Rationales in Patent Examination
Lim, Hyunseung, Nam, Sooyohn, Na, Sungmin, Cho, Ji Yong, Yang, June Yong, Shin, Hyungyu, Lee, Yoonjoo, Kim, Juho, Lee, Moontae, Hong, Hwajung
Patent examination remains an ongoing challenge in the NLP literature even after the advent of large language models (LLMs), as it requires an extensive yet nuanced human judgment on whether a submitted claim meets the statutory standards of novelty and non-obviousness against previously granted claims -- prior art -- in expert domains. Previous NLP studies have approached this challenge as a prediction task (e.g., forecasting grant outcomes) with high-level proxies such as similarity metrics or classifiers trained on historical labels. However, this approach often overlooks the step-by-step evaluations that examiners must make with profound information, including rationales for the decisions provided in office actions documents, which also makes it harder to measure the current state of techniques in patent review processes. To fill this gap, we construct PANORAMA, a dataset of 8,143 U.S. patent examination records that preserves the full decision trails, including original applications, all cited references, Non-Final Rejections, and Notices of Allowance. Also, PANORAMA decomposes the trails into sequential benchmarks that emulate patent professionals' patent review processes and allow researchers to examine large language models' capabilities at each step of them. Our findings indicate that, although LLMs are relatively effective at retrieving relevant prior art and pinpointing the pertinent paragraphs, they struggle to assess the novelty and non-obviousness of patent claims. We discuss these results and argue that advancing NLP, including LLMs, in the patent domain requires a deeper understanding of real-world patent examination. Our dataset is openly available at https://huggingface.co/datasets/LG-AI-Research/PANORAMA.
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Asia > Singapore (0.04)
- Asia > North Korea > Hwanghae-namdo > Haeju (0.04)
- (10 more...)
- Workflow (1.00)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models
Chiu, Christopher, Pitis, Silviu, van der Schaar, Mihaela
Clinical reasoning in medicine is a hypothesis-driven process where physicians refine diagnoses from limited information through targeted history, physical examination, and diagnostic investigations. In contrast, current medical benchmarks for large language models (LLMs) primarily assess knowledge recall through single-turn questions, where complete clinical information is provided upfront. To address this gap, we introduce VivaBench, a multi-turn benchmark that evaluates sequential clinical reasoning in LLM agents. Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training, requiring agents to actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis. While current LLMs demonstrate competence in diagnosing conditions from well-described clinical presentations, their performance degrades significantly when required to navigate iterative diagnostic reasoning under uncertainty in our evaluation. Our analysis identified several failure modes that mirror common cognitive errors in clinical practice, including: (1) fixation on initial hypotheses, (2) inappropriate investigation ordering, (3) premature diagnostic closure, and (4) failing to screen for critical conditions. These patterns reveal fundamental limitations in how current LLMs reason and make decisions under uncertainty. Through VivaBench, we provide a standardized benchmark for evaluating conversational medical AI systems for real-world clinical decision support. Beyond medical applications, we contribute to the larger corpus of research on agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex decision-making environments.
- North America > Canada > Ontario > Toronto (0.14)
- Oceania > Australia (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.92)
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Asia > China > Beijing > Beijing (0.04)
- Asia > Singapore (0.04)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Public Health (0.97)
- Leisure & Entertainment (0.68)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Michigan (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (8 more...)
- Asia > Taiwan (0.04)
- Asia > Middle East > Jordan (0.04)
- Leisure & Entertainment > Games (1.00)
- Leisure & Entertainment > Sports (0.68)