AITopics | evaluation agent

Collaborating Authors

evaluation agent

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Automatic Evaluation and Selection of PHI De-identification Models via Multi-Agent Collaboration

Wu, Guanchen, Chen, Zuhui, Xie, Yuzhang, Yang, Carl

arXiv.org Artificial IntelligenceNov-19-2025

Protected health information (PHI) de-identification is critical for enabling the safe reuse of clinical notes, yet evaluating and comparing PHI de-identification models typically depends on costly, small-scale expert annotations. We present TEAM-PHI, a multi-agent evaluation and selection framework that uses large language models (LLMs) to automatically measure de-identification quality and select the best-performing model without heavy reliance on gold labels. TEAM-PHI deploys multiple Evaluation Agents, each independently judging the correctness of PHI extractions and outputting structured metrics. Their results are then consolidated through an LLM-based majority voting mechanism that integrates diverse evaluator perspectives into a single, stable, and reproducible ranking. Experiments on a real-world clinical note corpus demonstrate that TEAM-PHI produces consistent and accurate rankings: despite variation across individual evaluators, LLM-based voting reliably converges on the same top-performing systems. Further comparison with ground-truth annotations and human evaluation confirms that the framework's automated rankings closely match supervised evaluation. By combining independent evaluation agents with LLM majority voting, TEAM-PHI offers a practical, secure, and cost-effective solution for automatic evaluation and best-model selection in PHI de-identification, even when ground-truth labels are limited.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.16194

Country:

Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > Indonesia > Bali (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Health Care Technology > Medical Record (0.89)
Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ArchPilot: A Proxy-Guided Multi-Agent Approach for Machine Learning Engineering

Yuan, Zhuowen, Liu, Tao, Yang, Yang, Wang, Yang, Qi, Feng, Rangadurai, Kaushik, Li, Bo, Yang, Shuang

arXiv.org Artificial IntelligenceNov-7-2025

Recent LLM-based agents have demonstrated strong capabilities in automated ML engineering. However, they heavily rely on repeated full training runs to evaluate candidate solutions, resulting in significant computational overhead, limited scalability to large search spaces, and slow iteration cycles. To address these challenges, we introduce ArchPilot, a multi-agent system that integrates architecture generation, proxy-based evaluation, and adaptive search into a unified framework. ArchPilot consists of three specialized agents: an orchestration agent that coordinates the search process using a Monte Carlo Tree Search (MCTS)-inspired novel algorithm with a restart mechanism and manages memory of previous candidates; a generation agent that iteratively generates, improves, and debugs candidate architectures; and an evaluation agent that executes proxy training runs, generates and optimizes proxy functions, and aggregates the proxy scores into a fidelity-aware performance metric. This multi-agent collaboration allows ArchPilot to prioritize high-potential candidates with minimal reliance on expensive full training runs, facilitating efficient ML engineering under limited budgets. Experiments on MLE-Bench demonstrate that ArchPilot outperforms SOTA baselines such as AIDE and ML-Master, validating the effectiveness of our multi-agent system.

archpilot, artificial intelligence, evaluation, (15 more...)

arXiv.org Artificial Intelligence

2511.03985

Country: North America > United States > Illinois (0.04)

Genre: Research Report (0.82)

Industry: Education > Curriculum > Subject-Specific Education (0.40)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

MAVUL: Multi-Agent Vulnerability Detection via Contextual Reasoning and Interactive Refinement

Li, Youpeng, Joshi, Kartik, Wang, Xinda, Wong, Eric

arXiv.org Artificial IntelligenceOct-2-2025

Most vulnerability detection (VD) methods are limited by inadequate contextual understanding, restrictive single-round interactions, and coarse-grained evaluations, resulting in undesired model performance and biased evaluation results. T o address these challenges, we propose MA VUL, a novel multi-agent VD system that integrates contextual reasoning and interactive refinement. Specifically, a vulnerability analyst agent is designed to flexibly leverage tool-using capabilities and contextual reasoning to achieve cross-procedural code understanding and effectively mine vulnerability patterns. Through iterative feedback and refined decision-making within cross-role agent interactions, the system achieves reliable reasoning and vulnerability prediction. Furthermore, MA VUL introduces multi-dimensional ground truth information for fine-grained evaluation, thereby enhancing evaluation accuracy and reliability. Extensive experiments conducted on a pairwise vulnerability dataset demonstrate MA VUL's superior performance. Our findings indicate that MA VUL significantly outperforms existing multi-agent systems with over 62% higher pairwise accuracy and single-agent systems with over 600% higher average performance. The system's effectiveness is markedly improved with increased communication rounds between the vulnerability analyst agent and the security architect agent, underscoring the importance of contextual reasoning in tracing vulnerability flows and the crucial feedback role. Additionally, the integrated evaluation agent serves as a critical, unbiased judge, ensuring a more accurate and reliable estimation of the system's real-world applicability by preventing misleading binary comparisons.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2510.00317

Country:

North America > United States > California > San Francisco County > San Francisco (0.28)
Europe > Austria > Vienna (0.14)
North America > United States > Texas (0.04)
(10 more...)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Configurable multi-agent framework for scalable and realistic testing of llm-based agents

Wang, Sai, Subramanian, Senthilnathan, Sahni, Mudit, Gone, Praneeth, Meng, Lingjie, Wang, Xiaochen, Bertoli, Nicolas Ferradas, Cheng, Tingxian, Xu, Jun

arXiv.org Artificial IntelligenceJul-22-2025

Large-language-model (LLM) agents exhibit complex, context-sensitive behaviour that quickly renders static benchmarks and ad-hoc manual testing obsolete. We present Neo, a configurable, multi-agent framework that automates realistic, multi-turn evaluation of LLM-based systems. Neo couples a Question Generation Agent and an Evaluation Agent through a shared context-hub, allowing domain prompts, scenario controls and dynamic feedback to be composed modularly. Test inputs are sampled from a probabilistic state model spanning dialogue flow, user intent and emotional tone, enabling diverse, human-like conversations that adapt after every turn. Applied to a production-grade Seller Financial Assistant chatbot, Neo (i) uncovered edge-case failures across five attack categories with a 3.3% break rate close to the 5.8% achieved by expert human red-teamers, and (ii) delivered 10-12X higher throughput, generating 180 coherent test questions in around 45 mins versus 16h of human effort. Beyond security probing, Neo's stochastic policies balanced topic coverage and conversational depth, yielding broader behavioural exploration than manually crafted scripts. Neo therefore lays a foundation for scalable, self-evolving LLM QA: its agent interfaces, state controller and feedback loops are model-agnostic and extensible to richer factual-grounding and policy-compliance checks. We release the framework to facilitate reproducible, high-fidelity testing of emerging agentic systems.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.14705

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews

Xu, Huimin, Yi, Seungjun, Lim, Terence, Xu, Jiawei, Well, Andrew, Mery, Carlos, Zhang, Aidong, Zhang, Yuji, Ji, Heng, Pingali, Keshav, Leng, Yan, Ding, Ying

arXiv.org Artificial IntelligenceMar-26-2025

Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data. TA provides valuable insights in healthcare but is resource-intensive. Large Language Models (LLMs) have been introduced to perform TA, yet their applications in healthcare remain unexplored. Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-Agent LLMs for clinical interviews. We leverage the scalability and coherence of multi-agent systems through structured conversations between agents and coordinate the expertise of cardiac experts in TA. Using interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we demonstrate that TAMA outperforms existing LLM-assisted TA approaches, achieving higher thematic hit rate, coverage, and distinctiveness. TAMA demonstrates strong potential for automated TA in clinical settings by leveraging multi-agent LLM systems with human-in-the-loop integration by enhancing quality while significantly reducing manual workload.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2503.20666

Country:

North America > United States > Texas > Travis County > Austin (0.05)
North America > United States > Tennessee > Davidson County > Nashville (0.04)
Asia > Singapore (0.04)
(3 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Armstrong, Stuart, Franklin, Matija, Stevens, Connor, Gorman, Rebecca

arXiv.org Artificial IntelligenceFeb-1-2025

Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that $100\%$ of the BoN paper's successful jailbreaks (confidence interval $[99.65\%, 100.00\%]$) and $99.8\%$ of successful jailbreaks in our replication (confidence interval $[99.28\%, 99.98\%]$) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors--unlike some other approaches, DATDP also explicitly looks for jailbreaking attempts--until a robust safety rating is generated. This success persisted even when utilizing smaller LLMs to power the evaluation (Claude and LLaMa-3-8B-instruct proved almost equally capable). These results show that, though language models are sensitive to seemingly innocuous changes to inputs, they seem also capable of successfully evaluating the dangers of these inputs. Versions of DATDP can therefore be added cheaply to generative AI systems to produce an immediate significant increase in safety.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.0058

Country:

North America > United States (0.14)
Africa > South Africa > Gauteng > Johannesburg (0.05)
North America > Canada (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)

Add feedback

Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models

Zhang, Fan, Tian, Shulin, Huang, Ziqi, Qiao, Yu, Liu, Ziwei

arXiv.org Artificial IntelligenceDec-15-2024

Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the process computationally expensive, especially for diffusion-based models with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs and provide numerical results without clear explanations. In contrast, humans can quickly form impressions of a model's capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools. Experiments show that Evaluation Agent reduces evaluation time to 10% of traditional methods while delivering comparable results. The Evaluation Agent framework is fully open-sourced to advance research in visual generative models and their efficient evaluation.

evaluation, evaluation agent, query, (12 more...)

arXiv.org Artificial Intelligence

2412.09645

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)

Add feedback

Agentic AI for Improving Precision in Identifying Contributions to Sustainable Development Goals

Ingram, William A., Banerjee, Bipasha, Fox, Edward A.

arXiv.org Artificial IntelligenceNov-26-2024

As research institutions increasingly commit to supporting the United Nations' Sustainable Development Goals (SDGs), there is a pressing need to accurately assess their research output against these goals. Current approaches, primarily reliant on keyword-based Boolean search queries, conflate incidental keyword matches with genuine contributions, reducing retrieval precision and complicating benchmarking efforts. This study investigates the application of autoregressive Large Language Models (LLMs) as evaluation agents to identify relevant scholarly contributions to SDG targets in scholarly publications. Using a dataset of academic abstracts retrieved via SDG-specific keyword queries, we demonstrate that small, locally-hosted LLMs can differentiate semantically relevant contributions to SDG targets from documents retrieved due to incidental keyword matches, addressing the limitations of traditional methods. By leveraging the contextual understanding of LLMs, this approach provides a scalable framework for improving SDG-related research metrics and informing institutional reporting.

classification, contribution, sdg target, (13 more...)

arXiv.org Artificial Intelligence

2411.17598

Country:

North America > United States > Virginia > Montgomery County > Blacksburg (0.05)
Asia > Thailand > Bangkok > Bangkok (0.04)

Genre: Research Report > New Finding (0.47)

Industry: Social Sector (0.74)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

From Motor Control to Team Play in Simulated Humanoid Football

Liu, Siqi, Lever, Guy, Wang, Zhe, Merel, Josh, Eslami, S. M. Ali, Hennes, Daniel, Czarnecki, Wojciech M., Tassa, Yuval, Omidshafiei, Shayegan, Abdolmaleki, Abbas, Siegel, Noah Y., Hasenclever, Leonard, Marris, Luke, Tunyasuvunakool, Saran, Song, H. Francis, Wulfmeier, Markus, Muller, Paul, Haarnoja, Tuomas, Tracey, Brendan D., Tuyls, Karl, Graepel, Thore, Heess, Nicolas

arXiv.org Artificial IntelligenceMay-25-2021

Intelligent behaviour in the physical world exhibits structure at multiple spatial and temporal scales. Although movements are ultimately executed at the level of instantaneous muscle tensions or joint torques, they must be selected to serve goals defined on much longer timescales, and in terms of relations that extend far beyond the body itself, ultimately involving coordination with other agents. Recent research in artificial intelligence has shown the promise of learning-based approaches to the respective problems of complex movement, longer-term planning and multi-agent coordination. However, there is limited research aimed at their integration. We study this problem by training teams of physically simulated humanoid avatars to play football in a realistic virtual environment. We develop a method that combines imitation learning, single- and multi-agent reinforcement learning and population-based training, and makes use of transferable representations of behaviour for decision making at different levels of abstraction. In a sequence of stages, players first learn to control a fully articulated body to perform realistic, human-like movements such as running and turning; they then acquire mid-level football skills such as dribbling and shooting; finally, they develop awareness of others and play as a team, bridging the gap between low-level motor control at a timescale of milliseconds, and coordinated goal-directed behaviour as a team at the timescale of tens of seconds. We investigate the emergence of behaviours at different levels of abstraction, as well as the representations that underlie these behaviours using several analysis techniques, including statistics from real-world sports analytics. Our work constitutes a complete demonstration of integrated decision-making at multiple scales in a physically embodied multi-agent setting. See project video at https://youtu.be/KHMwq9pv7mg.

agent, motor control, team play, (14 more...)

arXiv.org Artificial Intelligence

2105.12196

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > California > Santa Clara County > Stanford (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.14)
(17 more...)

Genre: Research Report > New Finding (0.92)

Industry:

Leisure & Entertainment > Sports > Soccer (1.00)
Leisure & Entertainment > Sports > Football (1.00)
Leisure & Entertainment > Games (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Add feedback