T. (21) Fromtheaboveequation,ker h=span h 0d0 n, Φ(2)

Neural Information Processing Systems 

The last equation is derived as follows. Inaddition, we set the observation varianceσx to 0.25. Logistic(;µ,s) is the density function of a logistic distribution with the location parameterµand the scale parameters,andσ isthe logistic sigmoid function. Before each activation, we apply the layer normalization [Ba et al., 2016] to stabilize training. When the model has sufficiently high expressive power,b may diverge to infinity [Rezende and Viola, 2018], so we add a regularization term of(b+2ζ( b))/m to the loss function, wherem is the number of training examples.