interactive evaluation
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > Canada (0.04)
- Europe > Ireland (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > Canada (0.04)
- Europe > Ireland (0.04)
Mind the Gap! Static and Interactive Evaluations of Large Audio Models
Li, Minzhi, Held, William Barr, Ryan, Michael J, Pipatanakul, Kunat, Manakul, Potsawee, Zhu, Hao, Yang, Diyi
As AI chatbots become ubiquitous, voice interaction presents a compelling way to enable rapid, high-bandwidth communication for both semantic and social signals. This has driven research into Large Audio Models (LAMs) to power voice-native experiences. However, aligning LAM development with user goals requires a clear understanding of user needs and preferences to establish reliable progress metrics. This study addresses these challenges by introducing an interactive approach to evaluate LAMs and collecting 7,500 LAM interactions from 484 participants. Through topic modeling of user queries, we identify primary use cases for audio interfaces. We then analyze user preference rankings and qualitative feedback to determine which models best align with user needs. Finally, we evaluate how static benchmarks predict interactive performance - our analysis reveals no individual benchmark strongly correlates with interactive results ($\tau \leq 0.33$ for all benchmarks). While combining multiple coarse-grained features yields modest predictive power ($R^2$=$0.30$), only two out of twenty datasets on spoken question answering and age prediction show significantly positive correlations. This suggests a clear need to develop LAM evaluations that better correlate with user preferences.
SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents
Zhou, Xuhui, Zhu, Hao, Mathur, Leena, Zhang, Ruohong, Yu, Haofei, Qi, Zhengyang, Morency, Louis-Philippe, Bisk, Yonatan, Fried, Daniel, Neubig, Graham, Sap, Maarten
Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide variety of scenarios; they coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-Eval. With SOTOPIA, we find significant differences between these models in terms of their social intelligence, and we identify a subset of SOTOPIA scenarios, SOTOPIA-hard, that is generally challenging for all models. We find that on this subset, GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills. These findings demonstrate SOTOPIA's promise as a general platform for research on evaluating and improving social intelligence in artificial agents.
EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria
Kim, Tae Soo, Lee, Yoonjoo, Shin, Jamin, Kim, Young-Ho, Kim, Juho
By simply composing prompts, developers can prototype novel generative applications with Large Language Models (LLMs). To refine prototypes into products, however, developers must iteratively revise prompts by evaluating outputs to diagnose weaknesses. Formative interviews (N=8) revealed that developers invest significant effort in manually evaluating outputs as they assess context-specific and subjective criteria. We present EvalLM, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria. By describing criteria in natural language, users can employ the system's LLM-based evaluator to get an overview of where prompts excel or fail, and improve these based on the evaluator's feedback. A comparative study (N=12) showed that EvalLM, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions. Beyond prompts, our work can be extended to augment model evaluation and alignment in specific application contexts.
Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with User Simulator
Cheng, Qinyuan, Li, Linyang, Quan, Guofeng, Gao, Feng, Mou, Xiaofeng, Qiu, Xipeng
Task-Oriented Dialogue (TOD) systems are drawing more and more attention in recent studies. Current methods focus on constructing pre-trained models or fine-tuning strategies while the evaluation of TOD is limited by a policy mismatch problem. That is, during evaluation, the user utterances are from the annotated dataset while these utterances should interact with previous responses which can have many alternatives besides annotated texts. Therefore, in this work, we propose an interactive evaluation framework for TOD. We first build a goal-oriented user simulator based on pre-trained models and then use the user simulator to interact with the dialogue system to generate dialogues. Besides, we introduce a sentence-level and a session-level score to measure the sentence fluency and session coherence in the interactive evaluation. Experimental results show that RL-based TOD systems trained by our proposed user simulator can achieve nearly 98% inform and success rates in the interactive evaluation of MultiWOZ dataset and the proposed scores measure the response quality besides the inform and success rates. We are hoping that our work will encourage simulator-based interactive evaluations in the TOD task.
- Europe > Italy > Tuscany > Florence (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (7 more...)
Interactive Evaluation of Dialog Track at DSTC9
Mehri, Shikib, Feng, Yulan, Gordon, Carla, Alavi, Seyed Hossein, Traum, David, Eskenazi, Maxine
The ultimate goal of dialog research is to develop systems that can be effectively used in interactive settings by real users. To this end, we introduced the Interactive Evaluation of Dialog Track at the 9th Dialog System Technology Challenge. This track consisted of two sub-tasks. The first sub-task involved building knowledge-grounded response generation models. The second sub-task aimed to extend dialog models beyond static datasets by assessing them in an interactive setting with real users. Our track challenges participants to develop strong response generation models and explore strategies that extend them to back-and-forth interactions with real users. The progression from static corpora to interactive evaluation introduces unique challenges and facilitates a more thorough assessment of open-domain dialog systems. This paper provides an overview of the track, including the methodology and results. Furthermore, it provides insights into how to best evaluate open-domain dialog models
- North America > United States > California (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
Evaluating Multimodal Interactive Agents
Abramson, Josh, Ahuja, Arun, Carnevale, Federico, Georgiev, Petko, Goldin, Alex, Hung, Alden, Landon, Jessica, Lillicrap, Timothy, Muldal, Alistair, Richards, Blake, Santoro, Adam, von Glehn, Tamara, Wayne, Greg, Wong, Nathaniel, Yan, Chen
Human behaviour is complex and nuanced. Consider how an act as simple as purchasing a cup of coffee involves an intricate spatio-temporal sequence of actions and perception: instructions, clarifications, and feedback weave across language, touch, and visual communicative cues, with the precise timing of each providing yet more information to our interactive partners. If we ever hope to create artificial agents that can participate in similar interactions, we must develop effective ways to evaluate their behaviour in naturalistic settings with humans. One obvious approach to evaluating interactive agent behaviour is to leverage a human's judgement during the course of their interaction with an agent. However, this requires a high human cost, both in number of human participants required and in total number of human hours spent, and has no straightforward mechanism to control for human behavioural diversity. The latter problem in particular can result in highly variable metrics if human behaviour is too noisy, or imprecise metrics if human behaviour is not diverse enough. Human behavior is also non-stationary over time, as it can be subtly impacted by agent performance, causing drift. Thus, despite being a "gold standard", the opacity of the online human-agent evaluation setting makes any generated metrics difficult to interpret and communicate, and hence, difficult to optimize for. Researchers therefore typically rely on other methods of evaluation, such as validation performance of the agent's optimized objective (e.g.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.64)