AITopics

Neural Information Processing SystemsJun-14-2026, 06:41:30 GMT

Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees

Selecting artificial intelligence (AI) models, such as large language models (LLMs), from multiple candidates requires accurate performance estimation. This is ideally achieved through empirical evaluations involving abundant real-world data. However, such evaluations are costly and impractical at scale. To address this challenge, autoevaluation methods leverage synthetic data produced by automated evaluators, such as LLMs-as-judges, reducing variance but potentially introducing bias. Recent approaches have employed semi-supervised prediction-powered inference ($\texttt{PPI}$) to correct for the bias of autoevaluators. However, the use of autoevaluators may lead in practice to a degradation in sample efficiency compared to conventional methods using only real-world data. In this paper, we propose $\texttt{R-AutoEval+}$, a novel framework that provides finite-sample reliability guarantees on the model evaluation, while also ensuring an enhanced (or at least no worse) sample efficiency compared to conventional methods. The key innovation of $\texttt{R-AutoEval+}$ is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data, reverting to conventional methods when the autoevaluator is insufficiently accurate. Experiments on the use of LLMs-as-judges for the optimization of quantization settings for the weights of an LLM, for prompt design in LLMs, and for test-time reasoning budget allocation in LLMs confirm the reliability and efficiency of $\texttt{R-AutoEval+}$.

artificial intelligence, large language model, natural language, (12 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsJun-14-2026, 05:18:01 GMT

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at this link .

artificial intelligence, large language model, natural language, (7 more...)

Industry: Health & Medicine > Health Care Technology > Medical Record (0.59)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsFeb-10-2026, 20:47:54 GMT

e430ad64df3de73e6be33bcb7f6d0dac-AuthorFeedback.pdf

conventional method, dimension, reviewer, (8 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.32)

Neural Information Processing SystemsAug-17-2025, 00:36:34 GMT

e430ad64df3de73e6be33bcb7f6d0dac-AuthorFeedback.pdf

conventional method, dimension, reviewer, (8 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.32)

Hanyu, Tatsuro, Katagiri, Takahiro, Mukunoki, Daichi, Hoshino, Tetsuya

Towards Generalized Parameter Tuning in Coherent Ising Machines: A Portfolio-Based Approach

arXiv.org Artificial IntelligenceJul-29-2025

-- Coherent Ising Machines (CIMs) have recently gained attention as a promising computing model for solving combinatorial optimization problems. In particular, the Chaotic Amplitude Control (CAC) algorithm has demonstrated high solution quality, but its performan ce is highly sensitive to a large number of hyperparameters, making efficient tuning essential. In this study, we present an algorithm portfolio approach for hyperparameter tuning in CIMs employing Chaotic Amplitude Control with momentum (CACm) algorithm. Our method incorporates multiple search strategies, enabling flexible and effective adaptation to the characteristics of the hyperparameter space. Specifically, we propose two representative tuning methods, Method A and Method B. Method A optimizes each hyperparameter sequentially with a fixed total number of trials, while Method B prioritizes hyperparameters based on initial evaluations before applying Method A in order. Performance evaluations were conducted on the Supercomputer "Flow" at Nagoya University, using planted Wishart instances and Time to Solution (TTS) as the evaluation metric. Compared to the baseline performance with best-known hyperparameters, Method A achieved up to 1.47 improvement, and Method B achieved up to 1.65 improvement. These results demonstrate the effectiveness of the algorithm portfolio approach in enhancing the tuning process for CIMs. A. Background As conventional computing approaches face limitations in solving large-scale combinatorial optimization problems, alternative models--such as quantum annealers and hybrid analog-digital systems--have garnered significant interest [1].

algorithm, artificial intelligence, optimization problem, (14 more...)

2507.20295

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)

arXiv.org Artificial IntelligenceJun-26-2025

Leveraging AI Graders for Missing Score Imputation to Achieve Accurate Ability Estimation in Constructed-Response Tests

Uto, Masaki, Ito, Yuma

Evaluating the abilities of learners is a fundamental objective in the field of education. In particular, there is an increasing need to assess higher-order abilities such as expressive skills and logical thinking. Constructed-response tests such as short-answer and essay-based questions have become widely used as a method to meet this demand. Although these tests are effective, they require substantial manual grading, making them both labor-intensive and costly. Item response theory (IRT) provides a promising solution by enabling the estimation of ability from incomplete score data, where human raters grade only a subset of answers provided by learners across multiple test items. However, the accuracy of ability estimation declines as the proportion of missing scores increases. Although data augmentation techniques for imputing missing scores have been explored in order to address this limitation, they often struggle with inaccuracy for sparse or heterogeneous data. To overcome these challenges, this study proposes a novel method for imputing missing scores by leveraging automated scoring technologies for accurate IRT-based ability estimation. The proposed method achieves high accuracy in ability estimation while markedly reducing manual grading workload.

large language model, machine learning, natural language, (18 more...)

2506.20119

Country: Asia > Japan > Honshū (0.28)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.54)

Industry:

Education > Assessment & Standards (0.97)
Education > Educational Technology > Educational Software > Computer Based Training (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Bhattacharya, Aditya, Verbert, Katrien

Let's Get You Hired: A Job Seeker's Perspective on Multi-Agent Recruitment Systems for Explaining Hiring Decisions

arXiv.org Artificial IntelligenceMay-28-2025

During job recruitment, traditional applicant selection methods often lack transparency. Candidates are rarely given sufficient justifications for recruiting decisions, whether they are made manually by human recruiters or through the use of black-box Applicant Tracking Systems (ATS). To address this problem, our work introduces a multi-agent AI system that uses Large Language Models (LLMs) to guide job seekers during the recruitment process. Using an iterative user-centric design approach, we first conducted a two-phased exploratory study with four active job seekers to inform the design and development of the system. Subsequently, we conducted an in-depth, qualitative user study with 20 active job seekers through individual one-to-one interviews to evaluate the developed prototype. The results of our evaluation demonstrate that participants perceived our multi-agent recruitment system as significantly more actionable, trustworthy, and fair compared to traditional methods. Our study further helped us uncover in-depth insights into factors contributing to these perceived user experiences. Drawing from these insights, we offer broader design implications for building user-aligned, multi-agent explainable AI systems across diverse domains.

large language model, machine learning, natural language, (19 more...)

2505.20312

Country:

Europe (1.00)
Asia (1.00)
North America > United States > New York > New York County > New York City (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Education (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Xu, Anran, Heagy, Lindsey J.

Towards Understanding the Benefits of Neural Network Parameterizations in Geophysical Inversions: A Study With Neural Fields

arXiv.org Machine LearningMar-21-2025

In this work, we employ neural fields, which use neural networks to map a coordinate to the corresponding physical property value at that coordinate, in a test-time learning manner. For a test-time learning method, the weights are learned during the inversion, as compared to traditional approaches which require a network to be trained using a training data set. Results for synthetic examples in seismic tomography and direct current resistivity inversions are shown first. We then perform a singular value decomposition analysis on the Jacobian of the weights of the neural network (SVD analysis) for both cases to explore the effects of neural networks on the recovered model. The results show that the test-time learning approach can eliminate unwanted artifacts in the recovered subsurface physical property model caused by the sensitivity of the survey and physics. Therefore, NFs-Inv improves the inversion results compared to the conventional inversion in some cases such as the recovery of the dip angle or the prediction of the boundaries of the main target. In the SVD analysis, we observe similar patterns in the left-singular vectors as were observed in some diffusion models, trained in a supervised manner, for generative tasks in computer vision. This observation provides evidence that there is an implicit bias, which is inherent in neural network structures, that is useful in supervised learning and test-time learning models. This implicit bias has the potential to be useful for recovering models in geophysical inversions.

artificial intelligence, inversion, machine learning, (17 more...)

arXiv.org Machine Learning

2503.17503

Country: North America > Canada (1.00)

Genre: Research Report > New Finding (0.88)

Industry: Energy > Oil & Gas > Upstream (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Khant, Kyi Shin, Lin, Hong Yi, Thongtanunam, Patanamon

Should Code Models Learn Pedagogically? A Preliminary Evaluation of Curriculum Learning for Real-World Software Engineering Tasks

arXiv.org Artificial IntelligenceFeb-6-2025

Learning-based techniques, especially advanced pre-trained models for code have demonstrated capabilities in code understanding and generation, solving diverse software engineering (SE) tasks. Despite the promising results, current training approaches may not fully optimize model performance, as they typically involve learning from randomly shuffled training data. Recent work shows that Curriculum Learning (CL) can improve performance on code-related tasks through incremental learning based on the difficulty of synthetic code. Yet, the effectiveness of CL with conventional difficulty measures in SE tasks remains largely unexplored. In this study, we explore two conventional code metrics: code length and cyclomatic complexity to determine the difficulty levels. We investigate how the pre-trained code model (CodeT5) learns under CL, through the tasks of code clone detection and code summarization. Our empirical study on the CodeXGLUE benchmark showed contrasting results to prior studies, where the model exhibited signs of catastrophic forgetting and shortcut learning. Surprisingly, model performance saturates after only the first quartile of training, potentially indicating a limit in the model's representation capacity and/or the task's inherent difficulty. Future work should further explore various CL strategies with different code models across a wider range of SE tasks for a more holistic understanding.

artificial intelligence, machine learning, natural language, (18 more...)

2502.03806

Country: North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)