Not enough data to create a plot.
Try a different view from the menu above.
Particle Dual Averaging: Optimization of Mean Field Neural Network with Global Convergence Rate Analysis
We propose the particle dual averaging (PDA) method, which generalizes the dual averaging method in convex optimization to the optimization over probability distributions with quantitative runtime guarantee. The algorithm consists of an inner loop and outer loop: the inner loop utilizes the Langevin algorithm to approximately solve for a stationary distribution, which is then optimized in the outer loop. The method can thus be interpreted as an extension of the Langevin algorithm to naturally handle nonlinear functional on the probability space. An important application of the proposed method is the optimization of neural network in the mean field regime, which is theoretically attractive due to the presence of nonlinear feature learning, but quantitative convergence rate can be challenging to obtain. By adapting finite-dimensional convex optimization theory into the space of measures, we analyze PDA in regularized empirical / expected risk minimization, and establish quantitative global convergence in learning two-layer mean field neural networks under more general settings. Our theoretical results are supported by numerical simulations on neural networks with reasonable size.
A The Embeddings
In this section, we briefly introduce the four kinds of emebddings consists the fusion embedding. The goal of position embedding module is to calibrate the position of each time point in the sequence so that the self-attention mechanism can recognize the relative positions between different time points in the input sequence. We design the token embedding module in order to enrich the features of each time point by fusion of other features from the adjacent time points within a certain interval. The role of spatial embedding is to locate and encode the spatial locations of different nodes, by which each node at different location possesses a unique spatial embedding. Thus, it enabling the model to identify nodes in different spatial and temporal planes after the dimensionality is compressed in the later computation.
2 Problem Formulation
The Bayes-Adaptive Markov Decision Process (BAMDP) formalism pursues the Bayes-optimal solution to the exploration-exploitation trade-off in reinforcement learning. As the computation of exact solutions to Bayesian reinforcement-learning problems is intractable, much of the literature has focused on developing suitable approximation algorithms. In this work, before diving into algorithm design, we first define, under mild structural assumptions, a complexity measure for BAMDP planning. As efficient exploration in BAMDPs hinges upon the judicious acquisition of information, our complexity measure highlights the worst-case difficulty of gathering information and exhausting epistemic uncertainty. To illustrate its significance, we establish a computationally-intractable, exact planning algorithm that takes advantage of this measure to show more efficient planning. We then conclude by introducing a specific form of state abstraction with the potential to reduce BAMDP complexity and gives rise to a computationally-tractable, approximate planning algorithm.
Supplementary Material
We provide more details of training the teacher network in Section A, more experimental results on synthetic functions in Section B, and the hyperparameter settings for benchmark datasets in Section C. Here, we omit the iteration subscript t for simplicity. To solve Eq. (10), we obtain the hypergradient regarding to and backpropagate it to = {W 2 R As shown in Algorithm 1, we train the teacher network one step when each time it is called by an underperforming student model, where the step refers to one iteration on synthetic functions and one epoch of the validation set on benchmark datasets in the experiment. In Section 4.1, we have shown the experimental results of HPM on two population synthetic functions, i.e., the Branin and Hartmann6D functions. In the following, we will provide more details about synthetic functions and the implementation, as well as more results on the other two functions. We used the Branin and Hartmann6D functions in Section 4.1.
We thank all reviewers for their time and constructive comments
We thank all reviewers for their time and constructive comments. We first address concerns that were brought up by multiple reviewers. NMODE is more sample efficient than other methods (Appendix C.2, first paragraph), so for density estimation The quantifier for Prop 5.1 should be "for some"; this will be fixed. Note that for small dimensions (e.g. Riemannian metric, and are thus Riemannian.
Instability and Local Minima in GAN Training with Kernel Discriminators
Generative Adversarial Networks (GANs) are a widely-used tool for generative modeling of complex data. Despite their empirical success, the training of GANs is not fully understood due to the min-max optimization of the generator and discriminator. This paper analyzes these joint dynamics when the true samples as well as the generated samples are discrete, finite sets, and the discriminator is kernel-based. A simple yet expressive framework for analyzing training called the Isolated Points Model is introduced. In the proposed model, the distance between true samples greatly exceeds the kernel width, so each generated point is influenced by at most one true point. Our model enables precise characterization of the conditions for convergence, both to good and bad minima. In particular, the analysis explains two common failure modes: (i) an approximate mode collapse and (ii) divergence. Numerical simulations are provided that predictably replicate these behaviors.
ImpatientCapsAndRuns: Approximately Optimal Algorithm Configuration from an Infinite Pool
Algorithm configuration procedures optimize parameters of a given algorithm to perform well over a distribution of inputs. Recent theoretical work focused on the case of selecting between a small number of alternatives. In practice, parameter spaces are often very large or infinite, and so successful heuristic procedures discard parameters "impatiently", based on very few observations.