Goto

Collaborating Authors

 Machine Translation


Cross-lingual Retrieval for Iterative Self-Supervised Training (supplementary materials) 1 Experiment details

Neural Information Processing Systems

In this section, we describe our experimental procedures in more details including hyperparameters, and intermediate results. For unsupervised machine translation task, we evaluate BLEU scores using multi-bleu.perl




On the Accuracy of Self-Normalized Log-Linear Models

Neural Information Processing Systems

Calculation of the log-normalizer is a major computational obstacle in applications of log-linear models with large output spaces. The problem of fast normalizer computation has therefore attracted significant attention in the theoretical and applied machine learning literature. In this paper, we analyze a recently proposed technique known as "self-normalization", which introduces a regularization term in training to penalize log normalizers for deviating from zero. This makes it possible to use unnormalized model scores as approximate probabilities. Empirical evidence suggests that self-normalization is extremely effective, but a theoretical understanding of why it should work, and how generally it can be applied, is largely lacking. We prove upper bounds on the loss in accuracy due to self-normalization, describe classes of input distributions that self-normalize easily, and construct explicit examples of high-variance input distributions. Our theoretical results make predictions about the difficulty of fitting self-normalized models to several classes of distributions, and we conclude with empirical validation of these predictions.




Language Models are Few-Shot Learners

Neural Information Processing Systems

Specifically, we train GPT -3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting.


Appendix of Prophet Attention

Neural Information Processing Systems

CIDEr-c40, which is the default ranking score in the leaderboard, and rank the 1st. Compared with image captioning, the target of video captioning is the video clip, i.e., an ordered The dataset contain 10,000 video clips, and each video is paired with 20 annotated sentences. We use the official splits to report our results. CIDEr, which is built upon on n-gram matching, is used in our tests for performance evaluation. All re-implementations and our experiments were ran on V100 GPUs.