Goto

Collaborating Authors

 unconditional


f14bc21be7eaeed046fed206a492e652-Supplemental.pdf

Neural Information Processing Systems

The major differences are as follows: 1) we use dropout and BN with weight normalization (WN) as a regularizer instead of the existing techniques such as spectral normalization (SN) and gradient penalties (GP). The BN is proven to function as a regularizer imposing the Lipschitz constraint [9], which has been achieved by SN and GP [3, 7]. Plus, the dropout and WN have been successfully adopted in the classifier-based model[12]. The learning rates of the discriminator and the generator are set according to two-timescale learning rate (TTUR) [4], which is adopted in Proj. SNGAN sets the learning rates ofthe discriminator and the generator as0.0004 and 0.0001, respectively,andtheyarefixedoverthecourse ofthetraining.


MAP's not dead yet: Uncovering true language model modes by conditioning away degeneracy

arXiv.org Artificial Intelligence

It has been widely observed that exact or approximate MAP (mode-seeking) decoding from natural language generation (NLG) models consistently leads to degenerate outputs (Stahlberg and Byrne, 2019, Holtzman et al., 2019). This has generally been attributed to either a fundamental inadequacy of modes in models or weaknesses in language modeling. Contrastingly in this work, we emphasize that degenerate modes can even occur in the absence of any model error, due to contamination of the training data. Specifically, we show that mixing even a tiny amount of low-entropy noise with a population text distribution can cause the data distribution's mode to become degenerate, implying that any models trained on it will be as well. As the unconditional mode of NLG models will often be degenerate, we therefore propose to apply MAP decoding to the model's distribution conditional on avoiding specific degeneracies. Using exact-search, we empirically verify that the length-conditional modes of machine translation models and language models are indeed more fluent and topical than their unconditional modes. For the first time, we also share many examples of exact modal sequences from these models, and from several variants of the LLaMA-7B model. Notably, the modes of the LLaMA models are still degenerate, showing that improvements in modeling have not fixed this issue. Because of the cost of exact mode finding algorithms, we develop an approximate mode finding approach, ACBS, which finds sequences that are both high-likelihood and high-quality. We apply this approach to LLaMA-7B, a model which was not trained for instruction following, and find that we are able to elicit reasonable outputs without any finetuning.