AITopics | evaluation phase

Collaborating Authors

evaluation phase

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Smith, Chandler, Abdulhai, Marwa, Diaz, Manfred, Tesic, Marko, Trivedi, Rakshit S., Vezhnevets, Alexander Sasha, Hammond, Lewis, Clifton, Jesse, Chang, Minsuk, Duéñez-Guzmán, Edgar A., Agapiou, John P., Matyas, Jayd, Karmon, Danny, Kundu, Akash, Korshuk, Aliaksei, Ananya, Ananya, Rahman, Arrasy, Kulandaivel, Avinaash Anand, McHale, Bain, Zhang, Beining, Alexander, Buyantuev, Rojas, Carlos Saith Rodriguez, Wang, Caroline, Talele, Chetan, Liu, Chenao, Lin, Chichen, Riazi, Diana, Shi, Di Yang, Tewolde, Emanuel, Tennant, Elizaveta, Zhong, Fangwei, Cui, Fuyang, Zhao, Gang, Piqueras, Gema Parreño, Yun, Hyeonggeun, Makarov, Ilya, Cui, Jiaxun, Purbey, Jebish, Dilkes, Jim, Nguyen, Jord, Xiao, Lingyun, Giraldo, Luis Felipe, Chacon-Chamorro, Manuela, Beltran, Manuel Sebastian Rios, Segura, Marta Emili García, Wang, Mengmeng, Alim, Mogtaba, Quijano, Nicanor, Schiavone, Nico, Macmillan-Scott, Olivia, Peña, Oswaldo, Stone, Peter, Kadiyala, Ram Mohan Rao, Fernandez, Rolando, Manrique, Ruben, Lu, Sunjia, McIlraith, Sheila A., Dhuri, Shamika, Shi, Shuqing, Gupta, Siddhant, Sarangi, Sneheel, Subramanian, Sriram Ganapathi, Cha, Taehun, Klassen, Toryn Q., Tu, Wenming, Fan, Weijian, Ruiyang, Wu, Feng, Xue, Du, Yali, Liu, Yang, Wang, Yiding, Kang, Yipeng, Sung, Yoonchang, Chen, Yuxuan, Zhang, Zhaowei, Wang, Zhihan, Wu, Zhiqiang, Chen, Ziang, Zheng, Zilong, Jia, Zixia, Wang, Ziyan, Hadfield-Menell, Dylan, Jaques, Natasha, Baarslag, Tim, Hernandez-Orallo, Jose, Leibo, Joel Z.

arXiv.org Artificial IntelligenceDec-4-2025

Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2512.03318

Country:

North America > United States (0.46)
Asia > China (0.28)
Europe > United Kingdom > England (0.28)
North America > Canada > Ontario > Toronto (0.14)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.87)

Industry: Leisure & Entertainment > Games > Computer Games (0.87)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Effects of Wrist-Worn Haptic Feedback on Force Accuracy and Task Speed during a Teleoperated Robotic Surgery Task

Vuong, Brian B., Davidson, Josie, Cheon, Sangheui, Cho, Kyujin, Okamura, Allison M.

arXiv.org Artificial IntelligenceJul-11-2025

--Previous work has shown that the addition of haptic feedback to the hands can improve awareness of tool-tissue interactions and enhance performance of teleoperated tasks in robot-assisted minimally invasive surgery. However, hand-based haptic feedback occludes direct interaction with the manipulanda of surgeon console in teleoperated surgical robots. We propose relocating haptic feedback to the wrist using a wearable haptic device so that haptic feedback mechanisms do not need to be integrated into the manipulanda. However, it is unknown if such feedback will be effective, given that it is not co-located with the finger movements used for manipulation. T o test if relocated haptic feedback improves force application during teleoperated tasks using da Vinci Research Kit (dVRK) surgical robot, participants learned to palpate a phantom tissue to desired forces. Participants performed the palpation task with and without wrist-worn haptic feedback and were evaluated for the accuracy of applied forces. Participants demonstrated statistically significant lower force error when wrist-worn haptic feedback was provided. Participants also performed the palpation task with longer movement times when provided wrist-worn haptic feedback, indicating that the haptic feedback may have caused participants to operate at a different point in the speed-accuracy tradeoff curve.

artificial intelligence, haptic feedback, participant, (16 more...)

arXiv.org Artificial Intelligence

2507.07327

Country: North America > United States > California (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Surgery (1.00)
Health & Medicine > Health Care Technology (1.00)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

POCAII: Parameter Optimization with Conscious Allocation using Iterative Intelligence

Inman, Joshua, Khandait, Tanmay, Sankar, Lalitha, Pedrielli, Giulia

arXiv.org Machine LearningMay-20-2025

In this paper we propose for the first time the hyperparameter optimization (HPO) algorithm POCAII. POCAII differs from the Hyperband and Successive Halving literature by explicitly separating the search and evaluation phases and utilizing principled approaches to exploration and exploitation principles during both phases. Such distinction results in a highly flexible scheme for managing a hyperparameter optimization budget by focusing on search (i.e., generating competing configurations) towards the start of the HPO process while increasing the evaluation effort as the HPO comes to an end. POCAII was compared to state of the art approaches SMAC, BOHB and DEHB. Our algorithm shows superior performance in low-budget hyperparameter optimization regimes. Since many practitioners do not have exhaustive resources to assign to HPO, it has wide applications to real-world problems. Moreover, the empirical evidence showed how POCAII demonstrates higher robustness and lower variance in the results. This is again very important when considering realistic scenarios with extremely expensive models to train.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2505.11745

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
(2 more...)

Add feedback

Technical Report: Evaluating Goal Drift in Language Model Agents

Arike, Rauno, Donoway, Elizabeth, Bartsch, Henning, Hobbhahn, Marius

arXiv.org Artificial IntelligenceMay-6-2025

As language models (LMs) are increasingly deployed as autonomous agents, their robust adherence to human-assigned objectives becomes crucial for safe operation. When these agents operate independently for extended periods without human oversight, even initially well-specified goals may gradually shift. Detecting and measuring goal drift - an agent's tendency to deviate from its original objective over time - presents significant challenges, as goals can shift gradually, causing only subtle behavioral changes. This paper proposes a novel approach to analyzing goal drift in LM agents. In our experiments, agents are first explicitly given a goal through their system prompt, then exposed to competing objectives through environmental pressures. We demonstrate that while the best-performing agent (a scaffolded version of Claude 3.5 Sonnet) maintains nearly perfect goal adherence for more than 100,000 tokens in our most difficult evaluation setting, all evaluated models exhibit some degree of goal drift. We also find that goal drift correlates with models' increasing susceptibility to pattern-matching behaviors as the context length grows.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.02709

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

SemEval-2025 Task 9: The Food Hazard Detection Challenge

Randl, Korbinian, Pavlopoulos, John, Henriksson, Aron, Lindgren, Tony, Bakagianni, Juli

arXiv.org Artificial IntelligenceMar-25-2025

In this challenge, we explored text-based food hazard prediction with long tail distributed classes. The task was divided into two subtasks: (1) predicting whether a web text implies one of ten food-hazard categories and identifying the associated food category, and (2) providing a more fine-grained classification by assigning a specific label to both the hazard and the product. Our findings highlight that large language model-generated synthetic data can be highly effective for oversampling long-tail distributions. Furthermore, we find that fine-tuned encoder-only, encoder-decoder, and decoder-only systems achieve comparable maximum performance across both subtasks. During this challenge, we gradually released (under CC BY-NC-SA 4.0) a novel set of 6,644 manually labeled food-incident reports.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.198

Country:

Europe > Austria > Vienna (0.15)
North America > United States (0.14)
Europe > Greece (0.05)
(5 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (1.00)
Food & Agriculture (0.68)
Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.47)
Education > Health & Safety > School Nutrition (0.34)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

Continual Skill and Task Learning via Dialogue

Gu, Weiwei, Kondepudi, Suresh, Huang, Lixiao, Gopalan, Nakul

arXiv.org Artificial IntelligenceSep-11-2024

Continual and interactive robot learning is a challenging problem as the robot is present with human users who expect the robot to learn novel skills to solve novel tasks perpetually with sample efficiency. In this work we present a framework for robots to query and learn visuo-motor robot skills and task relevant information via natural language dialog interactions with human users. Previous approaches either focus on improving the performance of instruction following agents, or passively learn novel skills or concepts. Instead, we used dialog combined with a language-skill grounding embedding to query or confirm skills and/or tasks requested by a user. To achieve this goal, we developed and integrated three different components for our agent. Firstly, we propose a novel visual-motor control policy ACT with Low Rank Adaptation (ACT-LoRA), which enables the existing SoTA ACT model to perform few-shot continual learning. Secondly, we develop an alignment model that projects demonstrations across skill embodiments into a shared embedding allowing us to know when to ask questions and/or demonstrations from users. Finally, we integrated an existing LLM to interact with a human user to perform grounded interactive continual skill learning to solve a task. Our ACT-LoRA model learns novel fine-tuned skills with a 100% accuracy when trained with only five demonstrations for a novel skill while still maintaining a 74.75% accuracy on pre-trained skills in the RLBench dataset where other models fall significantly short. We also performed a human-subjects study with 8 subjects to demonstrate the continual learning capabilities of our combined framework. We achieve a success rate of 75% in the task of sandwich making with the real robot learning from participant data demonstrating that robots can learn novel skills or task knowledge from dialogue with non-expert users using our approach.

agent, demonstration, robot, (15 more...)

arXiv.org Artificial Intelligence

2409.03166

Country: North America > United States > Arizona (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Education (0.94)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.70)

Add feedback

Automating Patch Set Generation from Code Review Comments Using Large Language Models

Rahman, Tajmilur, Singh, Rahul, Sultan, Mir Yousuf

arXiv.org Artificial IntelligenceApr-9-2024

The advent of Large Language Models (LLMs) has revolutionized various domains of artificial intelligence, including the realm of software engineering. In this research, we evaluate the efficacy of pre-trained LLMs in replicating the tasks traditionally performed by developers in response to code review comments. We provide code contexts to five popular LLMs and obtain the suggested code-changes (patch sets) derived from real-world code-review comments. The performance of each model is meticulously assessed by comparing their generated patch sets against the historical data of human-generated patch-sets from the same repositories. This comparative analysis aims to determine the accuracy, relevance, and depth of the LLMs' feedback, thereby evaluating their readiness to support developers in responding to code-review comments. Novelty: This particular research area is still immature requiring a substantial amount of studies yet to be done. No prior research has compared the performance of existing Large Language Models (LLMs) in code-review comments. This in-progress study assesses current LLMs in code review and paves the way for future advancements in automated code quality assurance, reducing context-switching overhead due to interruptions from code change requests.

code-review comment, developer, language model, (10 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3644815.3644981

2406.04346

Country:

Europe > Portugal > Lisbon > Lisbon (0.06)
North America > United States > Pennsylvania > Erie County > Erie (0.05)
North America > United States > New York > New York County > New York City (0.05)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

MasonTigers at SemEval-2024 Task 1: An Ensemble Approach for Semantic Textual Relatedness

Goswami, Dhiman, Puspo, Sadiya Sayara Chowdhury, Raihan, Md Nishat, Emran, Al Nahian Bin, Ganguly, Amrita, Zampieri, Marcos

arXiv.org Artificial IntelligenceApr-5-2024

This paper presents the MasonTigers entry to the SemEval-2024 Task 1 - Semantic Textual Relatedness. The task encompasses supervised (Track A), unsupervised (Track B), and cross-lingual (Track C) approaches across 14 different languages. MasonTigers stands out as one of the two teams who participated in all languages across the three tracks. Our approaches achieved rankings ranging from 11th to 21st in Track A, from 1st to 8th in Track B, and from 5th to 12th in Track C. Adhering to the task-specific constraints, our best performing approaches utilize ensemble of statistical machine learning approaches combined with language-specific BERT based models and sentence transformers.

arabic, ensemble 0, lr 0, (15 more...)

arXiv.org Artificial Intelligence

2403.1499

Country:

North America > United States (0.06)
Asia > Middle East > Jordan (0.04)
Africa (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

BTGenBot: Behavior Tree Generation for Robotic Tasks with Lightweight LLMs

Izzo, Riccardo Andrea, Bardaro, Gianluca, Matteucci, Matteo

arXiv.org Artificial IntelligenceMar-19-2024

This paper presents a novel approach to generating behavior trees for robots using lightweight large language models (LLMs) with a maximum of 7 billion parameters. The study demonstrates that it is possible to achieve satisfying results with compact LLMs when fine-tuned on a specific dataset. The key contributions of this research include the creation of a fine-tuning dataset based on existing behavior trees using GPT-3.5 and a comprehensive comparison of multiple LLMs (namely llama2, llama-chat, and code-llama) across nine distinct tasks. To be thorough, we evaluated the generated behavior trees using static syntactical analysis, a validation system, a simulated environment, and a real robot. Furthermore, this work opens the possibility of deploying such solutions directly on the robot, enhancing its practical applicability. Findings from this study demonstrate the potential of LLMs with a limited number of parameters in generating effective and efficient robot behaviors.

behavior tree, dataset, robot, (14 more...)

arXiv.org Artificial Intelligence

2403.12761

Country: Europe > Italy > Lombardy > Milan (0.04)

Genre:

Research Report > Promising Solution (0.34)
Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Hyperparameters in Continual Learning: a Reality Check

Cha, Sungmin, Cho, Kyunghyun

arXiv.org Artificial IntelligenceMar-13-2024

Various algorithms for continual learning (CL) have been designed with the goal of effectively alleviating the trade-off between stability and plasticity during the CL process. To achieve this goal, tuning appropriate hyperparameters for each algorithm is essential. As an evaluation protocol, it has been common practice to train a CL algorithm using diverse hyperparameter values on a CL scenario constructed with a benchmark dataset. Subsequently, the best performance attained with the optimal hyperparameter value serves as the criterion for evaluating the CL algorithm. In this paper, we contend that this evaluation protocol is not only impractical but also incapable of effectively assessing the CL capability of a CL algorithm. Returning to the fundamental principles of model evaluation in machine learning, we propose an evaluation protocol that involves Hyperparameter Tuning and Evaluation phases. Those phases consist of different datasets but share the same CL scenario. In the Hyperparameter Tuning phase, each algorithm is iteratively trained with different hyperparameter values to find the optimal hyperparameter values. Subsequently, in the Evaluation phase, the optimal hyperparameter values is directly applied for training each algorithm, and their performance in the Evaluation phase serves as the criterion for evaluating them. Through experiments on CIFAR-100 and ImageNet-100 based on the proposed protocol in class-incremental learning, we not only observed that the existing evaluation method fail to properly assess the CL capability of each algorithm but also observe that some recently proposed state-of-the-art algorithms, which reported superior performance, actually exhibit inferior performance compared to the previous algorithm.

algorithm, hyperparameter, hyperparameter value, (17 more...)

arXiv.org Artificial Intelligence

2403.09066

Country:

North America > United States > New York (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback