persistent training
Generative Modeling by Minimizing the Wasserstein-2 Loss
Huang, Yu-Jui, Malik, Zachariah
This paper approaches the unsupervised learning problem by minimizing the second-order Wasserstein loss (the $W_2$ loss) through a distribution-dependent ordinary differential equation (ODE), whose dynamics involves the Kantorovich potential associated with the true data distribution and a current estimate of it. A main result shows that the time-marginal laws of the ODE form a gradient flow for the $W_2$ loss, which converges exponentially to the true data distribution. An Euler scheme for the ODE is proposed and it is shown to recover the gradient flow for the $W_2$ loss in the limit. An algorithm is designed by following the scheme and applying persistent training, which naturally fits our gradient-flow approach. In both low- and high-dimensional experiments, our algorithm outperforms Wasserstein generative adversarial networks by increasing the level of persistent training appropriately.
A Differential Equation Approach for Wasserstein GANs and Beyond
Malik, Zachariah, Huang, Yu-Jui
We propose a new theoretical lens to view Wasserstein generative adversarial networks (WGANs). In our framework, we define a discretization inspired by a distribution-dependent ordinary differential equation (ODE). We show that such a discretization is convergent and propose a viable class of adversarial training methods to implement this discretization, which we call W1 Forward Euler (W1-FE). In particular, the ODE framework allows us to implement persistent training, a novel training technique that cannot be applied to typical WGAN algorithms without the ODE interpretation. Remarkably, when we do not implement persistent training, we prove that our algorithms simplify to existing WGAN algorithms; when we increase the level of persistent training appropriately, our algorithms outperform existing WGAN algorithms in both low- and high-dimensional examples.
Persistently Trained, Diffusion-assisted Energy-based Models
Zhang, Xinwei, Tan, Zhiqiang, Ou, Zhijian
Maximum likelihood (ML) learning for energy-based models (EBMs) is challenging, partly due to non-convergence of Markov chain Monte Carlo.Several variations of ML learning have been proposed, but existing methods all fail to achieve both post-training image generation and proper density estimation. We propose to introduce diffusion data and learn a joint EBM, called diffusion assisted-EBMs, through persistent training (i.e., using persistent contrastive divergence) with an enhanced sampling algorithm to properly sample from complex, multimodal distributions. We present results from a 2D illustrative experiment and image experiments and demonstrate that, for the first time for image data, persistently trained EBMs can {\it simultaneously} achieve long-run stability, post-training image generation, and superior out-of-distribution detection.
Persistent Neurons
Most algorithms used in neural networks(NN)-based leaning tasks are strongly affected by the choices of initialization. Good initialization can avoid sub-optimal solutions and alleviate saturation during training. However, designing improved initialization strategies is a difficult task and our understanding of good initialization is still very primitive. Here, we propose persistent neurons, a strategy that optimizes the learning trajectory using information from previous converged solutions. More precisely, we let the parameters explore new landscapes by penalizing the model from converging to the previous solutions under the same initialization. Specifically, we show that persistent neurons, under certain data distribution, is able to converge to more optimal solutions while initializations under popular framework find bad local minima. We further demonstrate that persistent neurons helps improve the model's performance under both good and poor initializations. Moreover, we evaluate full and partial persistent model and show it can be used to boost the performance on a range of NN structures, such as AlexNet and residual neural network. Saturation of activation functions during persistent training is also studied.