AITopics | interactive evaluation

Collaborating Authors

interactive evaluation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Neural Information Processing SystemsFeb-15-2026, 08:07:30 GMT

Building an open-domain conversational agent is a challenging problem.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > Canada (0.04)
Europe > Ireland (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
(2 more...)

Add feedback

Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Neural Information Processing SystemsAug-20-2025, 10:53:20 GMT

Building an open-domain conversational agent is a challenging problem.

evaluation, interactive evaluation, proceedings, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > Canada (0.04)
Europe > Ireland (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
(2 more...)

Add feedback

fc9812127bf09c7bd29ad6723c683fb5-AuthorFeedback.pdf

Neural Information Processing SystemsAug-20-2025, 10:53:05 GMT

evaluation, interactive evaluation, static evaluation, (11 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.31)

Add feedback

Mind the Gap! Static and Interactive Evaluations of Large Audio Models

Li, Minzhi, Held, William Barr, Ryan, Michael J, Pipatanakul, Kunat, Manakul, Potsawee, Zhu, Hao, Yang, Diyi

arXiv.org Artificial IntelligenceFeb-21-2025

As AI chatbots become ubiquitous, voice interaction presents a compelling way to enable rapid, high-bandwidth communication for both semantic and social signals. This has driven research into Large Audio Models (LAMs) to power voice-native experiences. However, aligning LAM development with user goals requires a clear understanding of user needs and preferences to establish reliable progress metrics. This study addresses these challenges by introducing an interactive approach to evaluate LAMs and collecting 7,500 LAM interactions from 484 participants. Through topic modeling of user queries, we identify primary use cases for audio interfaces. We then analyze user preference rankings and qualitative feedback to determine which models best align with user needs. Finally, we evaluate how static benchmarks predict interactive performance - our analysis reveals no individual benchmark strongly correlates with interactive results ($\tau \leq 0.33$ for all benchmarks). While combining multiple coarse-grained features yields modest predictive power ($R^2$=$0.30$), only two out of twenty datasets on spoken question answering and age prediction show significantly positive correlations. This suggests a clear need to develop LAM evaluations that better correlate with user preferences.

arxiv preprint arxiv, benchmark, evaluation, (13 more...)

arXiv.org Artificial Intelligence

2502.15919

Country:

Asia > India (0.04)
Asia > Sri Lanka (0.04)
Asia > Pakistan (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

Zhou, Xuhui, Zhu, Hao, Mathur, Leena, Zhang, Ruohong, Yu, Haofei, Qi, Zhengyang, Morency, Louis-Philippe, Bisk, Yonatan, Fried, Daniel, Neubig, Graham, Sap, Maarten

arXiv.org Artificial IntelligenceOct-17-2023

Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide variety of scenarios; they coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-Eval. With SOTOPIA, we find significant differences between these models in terms of their social intelligence, and we identify a subset of SOTOPIA scenarios, SOTOPIA-hard, that is generally challenging for all models. We find that on this subset, GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills. These findings demonstrate SOTOPIA's promise as a general platform for research on evaluating and improving social intelligence in artificial agents.

interactive evaluation, language agent, social intelligence, (1 more...)

arXiv.org Artificial Intelligence

2310.11667

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.53)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)

Add feedback

EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria

Kim, Tae Soo, Lee, Yoonjoo, Shin, Jamin, Kim, Young-Ho, Kim, Juho

arXiv.org Artificial IntelligenceSep-24-2023

By simply composing prompts, developers can prototype novel generative applications with Large Language Models (LLMs). To refine prototypes into products, however, developers must iteratively revise prompts by evaluating outputs to diagnose weaknesses. Formative interviews (N=8) revealed that developers invest significant effort in manually evaluating outputs as they assess context-specific and subjective criteria. We present EvalLM, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria. By describing criteria in natural language, users can employ the system's LLM-based evaluator to get an overview of where prompts excel or fail, and improve these based on the evaluator's feedback. A comparative study (N=12) showed that EvalLM, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions. Beyond prompts, our work can be extended to augment model evaluation and alignment in specific application contexts.

interactive evaluation, language model prompt, user-defined criteria, (1 more...)

arXiv.org Artificial Intelligence

2309.13633

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with User Simulator

Cheng, Qinyuan, Li, Linyang, Quan, Guofeng, Gao, Feng, Mou, Xiaofeng, Qiu, Xipeng

arXiv.org Artificial IntelligenceOct-26-2022

Task-Oriented Dialogue (TOD) systems are drawing more and more attention in recent studies. Current methods focus on constructing pre-trained models or fine-tuning strategies while the evaluation of TOD is limited by a policy mismatch problem. That is, during evaluation, the user utterances are from the annotated dataset while these utterances should interact with previous responses which can have many alternatives besides annotated texts. Therefore, in this work, we propose an interactive evaluation framework for TOD. We first build a goal-oriented user simulator based on pre-trained models and then use the user simulator to interact with the dialogue system to generate dialogues. Besides, we introduce a sentence-level and a session-level score to measure the sentence fluency and session coherence in the interactive evaluation. Experimental results show that RL-based TOD systems trained by our proposed user simulator can achieve nearly 98% inform and success rates in the interactive evaluation of MultiWOZ dataset and the proposed scores measure the response quality besides the inform and success rates. We are hoping that our work will encourage simulator-based interactive evaluations in the TOD task.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2210.14529

Country:

Europe > Italy > Tuscany > Florence (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry: Consumer Products & Services > Restaurants (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

Interactive Evaluation of Dialog Track at DSTC9

Mehri, Shikib, Feng, Yulan, Gordon, Carla, Alavi, Seyed Hossein, Traum, David, Eskenazi, Maxine

arXiv.org Artificial IntelligenceJul-28-2022

The ultimate goal of dialog research is to develop systems that can be effectively used in interactive settings by real users. To this end, we introduced the Interactive Evaluation of Dialog Track at the 9th Dialog System Technology Challenge. This track consisted of two sub-tasks. The first sub-task involved building knowledge-grounded response generation models. The second sub-task aimed to extend dialog models beyond static datasets by assessing them in an interactive setting with real users. Our track challenges participants to develop strong response generation models and explore strategies that extend them to back-and-forth interactions with real users. The progression from static corpora to interactive evaluation introduces unique challenges and facilitates a more thorough assessment of open-domain dialog systems. This paper provides an overview of the track, including the methodology and results. Furthermore, it provides insights into how to best evaluate open-domain dialog models

dialog, evaluation, human evaluation, (13 more...)

arXiv.org Artificial Intelligence

2207.14403

Country:

North America > United States > California (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.88)

Add feedback

Evaluating Multimodal Interactive Agents

Abramson, Josh, Ahuja, Arun, Carnevale, Federico, Georgiev, Petko, Goldin, Alex, Hung, Alden, Landon, Jessica, Lillicrap, Timothy, Muldal, Alistair, Richards, Blake, Santoro, Adam, von Glehn, Tamara, Wayne, Greg, Wong, Nathaniel, Yan, Chen

arXiv.org Artificial IntelligenceJul-14-2022

Human behaviour is complex and nuanced. Consider how an act as simple as purchasing a cup of coffee involves an intricate spatio-temporal sequence of actions and perception: instructions, clarifications, and feedback weave across language, touch, and visual communicative cues, with the precise timing of each providing yet more information to our interactive partners. If we ever hope to create artificial agents that can participate in similar interactions, we must develop effective ways to evaluate their behaviour in naturalistic settings with humans. One obvious approach to evaluating interactive agent behaviour is to leverage a human's judgement during the course of their interaction with an agent. However, this requires a high human cost, both in number of human participants required and in total number of human hours spent, and has no straightforward mechanism to control for human behavioural diversity. The latter problem in particular can result in highly variable metrics if human behaviour is too noisy, or imprecise metrics if human behaviour is not diverse enough. Human behavior is also non-stationary over time, as it can be subtly impacted by agent performance, causing drift. Thus, despite being a "gold standard", the opacity of the online human-agent evaluation setting makes any generated metrics difficult to interpret and communicate, and hence, difficult to optimize for. Researchers therefore typically rely on other methods of evaluation, such as validation performance of the agent's optimized objective (e.g.

agent, evaluation, scenario, (13 more...)

arXiv.org Artificial Intelligence

2205.13274

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.64)

Add feedback

Towards a Human-like Open-Domain Chatbot

Adiwardana, Daniel, Luong, Minh-Thang, So, David R., Hall, Jamie, Fiedel, Noah, Thoppilan, Romal, Yang, Zi, Kulshreshtha, Apoorv, Nemade, Gaurav, Lu, Yifeng, Le, Quoc V.

arXiv.org Machine LearningJan-31-2020

We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated.

meena, mitsuku, xiaoice, (16 more...)

arXiv.org Machine Learning

2001.09977

Country:

Oceania > Fiji (0.04)
Asia > Japan (0.04)
North America > United States > Hawaii (0.04)
(15 more...)

Genre:

Personal > Interview (1.00)
Research Report > New Finding (0.92)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)
Health & Medicine > Consumer Health (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Add feedback