Reinforcement learning (RL) is an effective approach to learn an optimal dialog policy for task-oriented visual dialog systems. A common practice is to apply RL on a neural sequence-to-sequence (seq2seq) framework with the action space being the output vocabulary in the decoder. However, it is difficult to design a reward function that can achieve a balance between learning an effective policy and generating a natural dialog response. This paper proposes a novel framework that alternatively trains a RL policy for image guessing and a supervised seq2seq model to improve dialog generation quality. We evaluate our framework on the GuessWhich task and the framework achieves the state-of-the-art performance in both task completion and dialog quality.
Recent advances in deep learning have resulted in a resurgence in the popularity of natural language generation (NLG). Many deep learning based models, including recurrent neural networks and generative adversarial networks, have been proposed and applied to generating various types of text. Despite the fast development of methods, how to better evaluate the quality of these natural language generators remains a significant challenge. We conduct an in-depth empirical study to evaluate the existing evaluation methods for natural language generation. We compare human-based evaluators with a variety of automated evaluation procedures, including discriminative evaluators that measure how well the generated text can be distinguished from human-written text, as well as text overlap metrics that measure how similar the generated text is to human-written references. We measure to what extent these different evaluators agree on the ranking of a dozen of state-of-the-art generators for online product reviews. We find that human evaluators do not correlate well with discriminative evaluators, leaving a bigger question of whether adversarial accuracy is the correct objective for natural language generation. In general, distinguishing machine-generated text is a challenging task even for human evaluators, and their decisions tend to correlate better with text overlap metrics. We also find that diversity is an intriguing metric that is indicative of the assessments of different evaluators.
Inspired by the free-energy brain theory, which implies that human visual system (HVS) tends to reduce uncertainty and restore perceptual details upon seeing a distorted image, we propose restorative adversarial net (RAN), a GAN-based model for no-reference image quality assessment (NR-IQA). RAN, which mimics the process of HVS, consists of three components: a restorator, a discriminator and an evaluator. The restorator restores and reconstructs input distorted image patches, while the discriminator distinguishes the reconstructed patches from the pristine distortion-free patches. After restoration, we observe that the perceptual distance between the restored and the distorted patches is monotonic with respect to the distortion level. We further define Gain of Restoration (GoR) based on this phenomenon. The evaluator predicts perceptual score by extracting feature representations from the distorted and restored patches to measure GoR. Eventually, the quality score of an input image is estimated by weighted sum of the patch scores. Experimental results on Waterloo Exploration, LIVE and TID2013 show the effectiveness and generalization ability of RAN compared to the state-of-the-art NR-IQA models.
We study response generation for open domain conversation in chatbots. Existing methods assume that words in responses are generated from an identical vocabulary regardless of their inputs, which not only makes them vulnerable to generic patterns and irrelevant noise, but also causes a high cost in decoding. We propose a dynamic vocabulary sequence-to-sequence (DVS2S) model which allows each input to possess their own vocabulary in decoding. In training, vocabulary construction and response generation are jointly learned by maximizing a lower bound of the true objective with a Monte Carlo sampling method. In inference, the model dynamically allocates a small vocabulary for an input with the word prediction model, and conducts decoding only with the small vocabulary. Because of the dynamic vocabulary mechanism, DVS2S eludes many generic patterns and irrelevant words in generation, and enjoys efficient decoding at the same time. Experimental results on both automatic metrics and human annotations show that DVS2S can significantly outperform state-of-the-art methods in terms of response quality, but only requires 60% decoding time compared to the most efficient baseline.
The paper discusses what kind of generation model is suitable for spoken dialogue responses. We describe different existing models of generation, and compare them from the point of view of spoken dialogue systems. We introduce a model of generation based on new information focus, and argue that this addresses the communicative requirements of flexible spoken dialogue systems (incrementality, immediacy and interactivity). We discuss the relationship between the dialogue manager and the generator, and what kind of interface they should share. We describe a flexible shallow generation approach which combines template-based generation with a pipeline of distinct processing levels.