Williams, Jason D.
DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial Issues
Sun, David Q., Abzaliev, Artem, Kotek, Hadas, Xiu, Zidi, Klein, Christopher, Williams, Jason D.
Controversy is a reflection of our zeitgeist, and an important aspect to any discourse. The rise of large language models (LLMs) as conversational systems has increased public reliance on these systems for answers to their various questions. Consequently, it is crucial to systematically examine how these models respond to questions that pertaining to ongoing debates. However, few such datasets exist in providing human-annotated labels reflecting the contemporary discussions. To foster research in this area, we propose a novel construction of a controversial questions dataset, expanding upon the publicly released Quora Question Pairs Dataset. This dataset presents challenges concerning knowledge recency, safety, fairness, and bias. We evaluate different LLMs using a subset of this dataset, illuminating how they handle controversial issues and the stances they adopt. This research ultimately contributes to our understanding of LLMs' interaction with controversial issues, paving the way for improvements in their comprehension and handling of complex societal debates.
Intelligent Assistant Language Understanding On Device
Aas, Cecilia, Abdelsalam, Hisham, Belousova, Irina, Bhargava, Shruti, Cheng, Jianpeng, Daland, Robert, Driesen, Joris, Flego, Federico, Guigue, Tristan, Johannsen, Anders, Lal, Partha, Lu, Jiarui, Moniz, Joel Ruben Antony, Perkins, Nathan, Piraviperumal, Dhivya, Pulman, Stephen, Sรฉaghdha, Diarmuid ร, Sun, David Q., Torr, John, Del Vecchio, Marco, Wacker, Jay, Williams, Jason D., Yu, Hong
It has recently become feasible to run personal digital assistants on phones and other personal devices. In this paper we describe a design for a natural language understanding system that runs on device. In comparison to a server-based assistant, this system is more private, more reliable, faster, more expressive, and more accurate. We describe what led to key choices about architecture and technologies. For example, some approaches in the dialog systems literature are difficult to maintain over time in a deployment setting. We hope that sharing learnings from our practical experiences may help inform future work in the research community.
Feedback Effect in User Interaction with Intelligent Assistants: Delayed Engagement, Adaption and Drop-out
Xiu, Zidi, Cheng, Kai-Chen, Sun, David Q., Lu, Jiannan, Kotek, Hadas, Zhang, Yuhan, McCarthy, Paul, Klein, Christopher, Pulman, Stephen, Williams, Jason D.
With the growing popularity of intelligent assistants (IAs), evaluating IA quality becomes an increasingly active field of research. This paper identifies and quantifies the feedback effect, a novel component in IA-user interactions - how the capabilities and limitations of the IA influence user behavior over time. First, we demonstrate that unhelpful responses from the IA cause users to delay or reduce subsequent interactions in the short term via an observational study. Next, we expand the time horizon to examine behavior changes and show that as users discover the limitations of the IA's understanding and functional capabilities, they learn to adjust the scope and wording of their requests to increase the likelihood of receiving a helpful response from the IA. Our findings highlight the impact of the feedback effect at both the micro and meso levels. We further discuss its macro-level consequences: unsatisfactory interactions continuously reduce the likelihood and diversity of future user engagements in a feedback loop.
Active Learning for Domain Classification in a Commercial Spoken Personal Assistant
Chen, Xi C., Sagar, Adithya, Kao, Justine T., Li, Tony Y., Klein, Christopher, Pulman, Stephen, Garg, Ashish, Williams, Jason D.
We describe a method for selecting relevant new training data for the LSTM-based domain selection component of our personal assistant system. Adding more annotated training data for any ML system typically improves accuracy, but only if it provides examples not already adequately covered in the existing data. However, obtaining, selecting, and labeling relevant data is expensive. This work presents a simple technique that automatically identifies new helpful examples suitable for human annotation. Our experimental results show that the proposed method, compared with random-selection and entropy-based methods, leads to higher accuracy improvements given a fixed annotation budget. Although developed and tested in the setting of a commercial intelligent assistant, the technique is of wider applicability.
NAIL: A General Interactive Fiction Agent
Hausknecht, Matthew, Loynd, Ricky, Yang, Greg, Swaminathan, Adith, Williams, Jason D.
Interactive Fiction (IF) games are complex textual decision making problems. This paper introduces NAIL, an autonomous agent for general parser-based IF games. NAIL won the 2018 Text Adventure AI Competition, where it was evaluated on twenty unseen games. This paper describes the architecture, development, and insights underpinning NAIL's performance.
Sample-efficient Deep Reinforcement Learning for Dialog Control
Asadi, Kavosh, Williams, Jason D.
Representing a dialog policy as a recurrent neural network (RNN) is attractive because it handles partial observability, infers a latent representation of state, and can be optimized with supervised learning (SL) or reinforcement learning (RL). For RL, a policy gradient approach is natural, but is sample inefficient. In this paper, we present 3 methods for reducing the number of dialogs required to optimize an RNN-based dialog policy with RL. The key idea is to maintain a second RNN which predicts the value of the current policy, and to apply experience replay to both networks. On two tasks, these methods reduce the number of dialogs/episodes required by about a third, vs. standard policy gradient methods.
The Dialog State Tracking Challenge Series
Williams, Jason D. (Microsoft Corporation) | Henderson, Matthew (Cambridge University) | Raux, Antoine (Lenovo Labs) | Thomson, Blaise (VocalIQ, Ltd) | Black, Alan (Carnegie Mellon University) | Ramachandran, Deepak (Nuance Communications, Inc.)
Dialog state tracking is difficult because automatic speech recognition (ASR) and spoken language understanding (SLU) errors are common and can cause the system to misunderstand the user. At the same time, state tracking is crucial because the system relies on the estimated dialog state to choose actions -- for example, which restaurants to suggest. Figure 1 shows an illustration of the dialog state tracking task. Historically dialog state tracking has been done with handcrafted rules. More recently, statistical methods have been found to be superior by effectively overcoming some SLU errors, resulting in better dialogs. Despite this progress, direct comparisons between methods have not been possible because past studies use different domains, system components, and evaluation measures, hindering progresss.