Evaluating Multimodal Interactive Agents

Abramson, Josh, Ahuja, Arun, Carnevale, Federico, Georgiev, Petko, Goldin, Alex, Hung, Alden, Landon, Jessica, Lillicrap, Timothy, Muldal, Alistair, Richards, Blake, Santoro, Adam, von Glehn, Tamara, Wayne, Greg, Wong, Nathaniel, Yan, Chen

arXiv.org Artificial Intelligence 

Human behaviour is complex and nuanced. Consider how an act as simple as purchasing a cup of coffee involves an intricate spatio-temporal sequence of actions and perception: instructions, clarifications, and feedback weave across language, touch, and visual communicative cues, with the precise timing of each providing yet more information to our interactive partners. If we ever hope to create artificial agents that can participate in similar interactions, we must develop effective ways to evaluate their behaviour in naturalistic settings with humans. One obvious approach to evaluating interactive agent behaviour is to leverage a human's judgement during the course of their interaction with an agent. However, this requires a high human cost, both in number of human participants required and in total number of human hours spent, and has no straightforward mechanism to control for human behavioural diversity. The latter problem in particular can result in highly variable metrics if human behaviour is too noisy, or imprecise metrics if human behaviour is not diverse enough. Human behavior is also non-stationary over time, as it can be subtly impacted by agent performance, causing drift. Thus, despite being a "gold standard", the opacity of the online human-agent evaluation setting makes any generated metrics difficult to interpret and communicate, and hence, difficult to optimize for. Researchers therefore typically rely on other methods of evaluation, such as validation performance of the agent's optimized objective (e.g.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found