Goto

Collaborating Authors

 Roller, Stephen


Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System

arXiv.org Artificial Intelligence

End-to-end task-oriented dialogue (TOD) systems have achieved promising performance by leveraging sophisticated natural language understanding and natural language generation capabilities of pre-trained models. This work enables the TOD systems with more flexibility through a simple cache. The cache provides the flexibility to dynamically update the TOD systems and handle both existing and unseen dialogue scenarios. Towards this end, we first fine-tune a retrieval module to effectively retrieve the most relevant information entries from the cache. We then train end-to-end TOD models that can refer to and ground on both dialogue history and retrieved information during TOD generation. The cache is straightforward to construct, and the backbone models of TOD systems are compatible with existing pre-trained generative models. Extensive experiments demonstrate the superior performance of our framework, with a notable improvement in non-empty joint goal accuracy by 6.7% compared to strong baselines.


Leveraging Implicit Feedback from Deployment Data in Dialogue

arXiv.org Artificial Intelligence

We study improving social conversational agents by learning from natural dialogue between users and a deployed model, without extra annotations. To implicitly measure the quality of a machine-generated utterance, we leverage signals like user response length, sentiment and reaction of the future human utterances in the collected dialogue episodes. Our experiments use the publicly released deployment data from BlenderBot (Xu et al., 2023). Human evaluation indicates improvements in our new models over baseline responses; however, we find that some proxy signals can lead to more generations with undesirable properties as well. For example, optimizing for conversation length can lead to more controversial or unfriendly generations compared to the baseline, whereas optimizing for positive sentiment or reaction can decrease these behaviors.


A Theory on Adam Instability in Large-Scale Machine Learning

arXiv.org Artificial Intelligence

Training instability reported by Chowdhery et al. [2022] is an interesting phenomenon that has only been reported for the large language models trained on an order of a trillion tokens, posing a threat to further scaling of the AI systems. Chowdhery et al. [2022] have observed dozens of spikes in the loss curve throughout training. To mitigate the issue, they re-started training from a checkpoint roughly 100 steps before the spike started, and skipped roughly 200-500 data batches, in order to exclude batches that were seen right before and during the spike. In that case, the spike of the loss value did not repeat. The spikes were also not observed when the skipped data was fed through the model again after the aforementioned mitigation, which implies that the data itself did not cause the spike, but rather an interference of the data batch with the state of the model training run. The purpose of this work is to rigorously reproduce the experiment with a different hardware and software setup, come up with an explanation for the observed behavior supported by empirical evidence and theoretical arguments, and propose alternative ways of mitigating the issue. Loss spikes are difficult to study because any reproduction of these spikes at a smaller scale is not necessarily caused by or remediated by the same factors as in larger scales. We therefore analyze large-scale language modeling experiments, training four models between 7 billion and 546 billion parameters. The models are decoder-only transformers [Brown et al., 2020, Smith et al., 2022] with different depth and embedding dimensions and trained using the AdamW [Loshchilov and Hutter, 2017] algorithm with a linear learning rate schedule.


Scaling Laws for Generative Mixed-Modal Language Models

arXiv.org Artificial Intelligence

Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speechtext model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties. Generative language models have been developed for a wide range of data modalities, including natural language text Brown et al. (2020), code (Chen et al., 2021; Fried et al., 2022), images (Ramesh et al., 2021; Yasunaga et al., 2022), and molecules or proteins (Chilingaryan et al., 2022; Hsu et al., 2022). Recent work has also introduced unified models (Aghajanyan et al., 2022; Reed et al., 2022; Wang et al., 2022; Zellers et al., 2022) that can simultaneously model multiple modalities. One advantage of generative modeling in these cases is that the models scale well in practice; adding data, compute, or parameters typically improves model quality. These scaling trends have been carefully studied for uni-modal models (Kaplan et al., 2020; Hoffmann et al., 2022) and some recent work focuses on pairs of modalities (Droppo & Elibol, 2021; Henighan et al., 2020).


Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

arXiv.org Artificial Intelligence

At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known (Liu et al., 2016, arXiv:1603.08023), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions.


Not All Memories are Created Equal: Learning to Forget by Expiring

arXiv.org Artificial Intelligence

Attention mechanisms have shown promising results in sequence modeling tasks that require long-term memory. Recent work investigated mechanisms to reduce the computational cost of preserving and storing memories. However, not all content in the past is equally important to remember. We propose Expire-Span, a method that learns to retain the most important information and expire the irrelevant information. This forgetting of memories enables Transformers to scale to attend over tens of thousands of previous timesteps efficiently, as not all states from previous timesteps are preserved. We demonstrate that Expire-Span can help models identify and retain critical information and show it can achieve strong performance on reinforcement learning tasks specifically designed to challenge this functionality. Next, we show that Expire-Span can scale to memories that are tens of thousands in size, setting a new state of the art on incredibly long context tasks such as character-level language modeling and a frame-by-frame moving objects task. Finally, we analyze the efficiency of Expire-Span compared to existing approaches and demonstrate that it trains faster and uses less memory.


Open-Domain Conversational Agents: Current Progress, Open Problems, and Future Directions

arXiv.org Artificial Intelligence

Further, we discuss only open academic research with entertaining wit and knowledge while making others feel reproducible published results, hence we will not address heard. The breadth of possible conversation topics and lack much of the considerable work that has been put into building of a well-defined objective make it challenging to define a commercial systems, where methods, data and results roadmap towards training a good conversational agent, or are not in the public domain. Finally, given that we focus on chatbot. Despite recent progress across the board (Adiwardana open-domain conversation, we do not focus on specific goaloriented et al., 2020; Roller et al., 2020), conversational agents techniques; we also do not cover spoken dialogue in are still incapable of carrying an open-domain conversation this work, focusing on text and image input/output only. For that remains interesting, consistent, accurate, and reliably more general recent surveys, see Gao et al. (2019); Jurafsky well-behaved (e.g., not offensive) while navigating a variety and Martin (2019); Huang, Zhu, and Gao (2020). of topics. Traditional task-oriented dialogue systems rely on slotfilling and structured modules (e.g., Young et al. (2013); Gao et al. (2019); Jurafsky and Martin (2019)).


Neural Text Generation with Unlikelihood Training

arXiv.org Machine Learning

Neural text generation is a key tool in natural language applications, but it is well known there are major problems at its core. In particular, standard likelihood training and decoding leads to dull and repetitive responses. While some post-hoc fixes have been proposed, in particular top-k and nucleus sampling, they do not address the fact that the token-level probabilities predicted by the model itself are poor. In this paper we show that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences that contain repeats and frequent words unlike the human training distribution. We propose a new objective, unlikelihood training, which forces unlikely generations to be assigned lower probability by the model. We show that both token and sequence level unlikelihood training give less repetitive, less dull text while maintaining perplexity, giving far superior generations using standard greedy or beam search. Our approach provides a strong alternative to traditional training.


Relations such as Hypernymy: Identifying and Exploiting Hearst Patterns in Distributional Vectors for Lexical Entailment

arXiv.org Artificial Intelligence

We consider the task of predicting lexical entailment using distributional vectors. We perform a novel qualitative analysis of one existing model which was previously shown to only measure the prototypicality of word pairs. We find that the model strongly learns to identify hypernyms using Hearst patterns, which are well known to be predictive of lexical relations. We present a novel model which exploits this behavior as a method of feature extraction in an iterative procedure similar to Principal Component Analysis. Our model combines the extracted features with the strengths of other proposed models in the literature, and matches or outperforms prior work on multiple data sets.


Design and Evaluation of Afterthought, A System that Automatically Creates Highlight Cinematics for 3D Games

AAAI Conferences

Online multiplayer gaming has emerged as a popular form of entertainment. the course of a multiplayer game, playerinteractions may result in interesting emer- gent narratives that go unnoticed. Afterthought is a system that monitors player activity, recognizes instances of story elements in gameplay and renders cinematic highlights of the story-oriented game play, allowing players to view these emergent narratives after completing their gameplay session. This paper describes Afterthought’s implementation as well as an empirical human-subjects evaluation of the effectiveness of the cinematics that it creates.