Goto

Collaborating Authors

 Oceania


Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation

Neural Information Processing Systems

Neural Machine Translation (NMT) has achieved remarkable progress with the quick evolvement of model structures. In this paper, we propose the concept of layer-wise coordination for NMT, which explicitly coordinates the learning of hidden representations of the encoder and decoder together layer by layer,gradually from lowleveltohigh level.



Change-pointDetectionforSparseandDense FunctionalDatainGeneralDimensions

Neural Information Processing Systems

We study the problem of change-point detection and localisation for functional data sequentially observed on a generald-dimensional space, where we allow thefunctional curvestobeeither sparsely ordensely sampled.



ExplainMySurprise: LearningEfficientLong-Term MemorybyPredictingUncertainOutcomes

Neural Information Processing Systems

In many sequential tasks, a model needs to remember relevant events from the distant past to make correct predictions. Unfortunately, a straightforward application ofgradient based training requires intermediate computations tobestored for every element of a sequence. This requires to store prohibitively large intermediate data ifasequence consists ofthousands oreven millions elements, and asaresult, makeslearning ofverylong-term dependencies infeasible.






Supplementary Material for Flat Seeking Bayesian Neural Networks Van-Anh Nguyen 1 Tung-Long Vuong

Neural Information Processing Systems

The proof can be found in Chapter 27 of [6]. For the non-flat version, the update is similar to the mini-batch SGD except that we add small Gaussian noises to the particle models. In Section 4.2 of the main paper, we provide a comprehensive analysis of the performance concerning In the experiments presented in Tables 1 and 2 in the main paper, we train all models for 300 epochs using SGD, with a learning rate of 0.1 and a cosine schedule. For the baseline of the Deep-Ensemble, SGLD, SGVB and SGVB-LRT methods, we reproduce results following the hyper-parameters and processes as our flat versions. ImageNet: This is a large and challenging dataset with 1000 classes.