Oceania
Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation
Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo Chen, Tie-Yan Liu
Neural Machine Translation (NMT) has achieved remarkable progress with the quick evolvement of model structures. In this paper, we propose the concept of layer-wise coordination for NMT, which explicitly coordinates the learning of hidden representations of the encoder and decoder together layer by layer,gradually from lowleveltohigh level.
ExplainMySurprise: LearningEfficientLong-Term MemorybyPredictingUncertainOutcomes
In many sequential tasks, a model needs to remember relevant events from the distant past to make correct predictions. Unfortunately, a straightforward application ofgradient based training requires intermediate computations tobestored for every element of a sequence. This requires to store prohibitively large intermediate data ifasequence consists ofthousands oreven millions elements, and asaresult, makeslearning ofverylong-term dependencies infeasible.
Supplementary Material for Flat Seeking Bayesian Neural Networks Van-Anh Nguyen 1 Tung-Long Vuong
The proof can be found in Chapter 27 of [6]. For the non-flat version, the update is similar to the mini-batch SGD except that we add small Gaussian noises to the particle models. In Section 4.2 of the main paper, we provide a comprehensive analysis of the performance concerning In the experiments presented in Tables 1 and 2 in the main paper, we train all models for 300 epochs using SGD, with a learning rate of 0.1 and a cosine schedule. For the baseline of the Deep-Ensemble, SGLD, SGVB and SGVB-LRT methods, we reproduce results following the hyper-parameters and processes as our flat versions. ImageNet: This is a large and challenging dataset with 1000 classes.