Evaluating Human-Language Model Interaction

Lee, Mina, Srivastava, Megha, Hardy, Amelia, Thickstun, John, Durmus, Esin, Paranjape, Ashwin, Gerard-Ursin, Ines, Li, Xiang Lisa, Ladhak, Faisal, Rong, Frieda, Wang, Rose E., Kwon, Minae, Park, Joon Sung, Cao, Hancheng, Lee, Tony, Bommasani, Rishi, Bernstein, Michael, Liang, Percy

Jan-5-2024–arXiv.org Artificial Intelligence

Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Jan-5-2024

arXiv.org PDF

Add feedback

Country:
- Europe > United Kingdom
  - England (0.14)
- North America > United States (1.00)

Genre:
- Questionnaire & Opinion Survey (1.00)
- Research Report
  - Experimental Study (1.00)
  - New Finding (1.00)

Industry:
- Education (1.00)
- Government > Regional Government
  - North America Government > United States Government (0.45)
- Health & Medicine (0.93)
- Law Enforcement & Public Safety (1.00)
- Leisure & Entertainment
  - Games (1.00)
  - Sports > Tennis (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.66)
    - Statistical Learning > Regression (0.46)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)
  - Representation & Reasoning > Personal Assistant Systems (0.93)