Learning the Simplest Neural ODE
Okamoto, Yuji, Takeuchi, Tomoya, Sakemi, Yusuke
Because ODE models allow one to embed inherent properties--for example, Hamiltonian structure [4] or stability guarantees [5-7]--Neural ODEs often achieve accurate long-term forecasting. Moreover, treating the solution map of an ODE as a diffeomorphism has led to applications in generative modeling. Compared to normalizing flows [8], such continuous-time models can yield low-parameter, memory-efficient generators [9]. Gradient-based optimization is the de-facto standard for learning NN-defined dynamics, enabled by the adjoint method [1] for efficient gradient computation. Nonetheless, several issues arise: (i) Sensitivity of gradients to the sampling interval of time-series data, (ii) V ariation of convergence speed depending on NN initialization, (iii) Difficulty of tuning hyper-parameters such as leaning rate. These factors often cause training instability or gradient vanishing/explosion. Existing work addresses them heuristically by restricting the search space of dynamics [9] or by tailored initialization [10]. However, the fundamental question-- why is training Neural ODE hard?
May-6-2025