Goto

Collaborating Authors

 minre


Supplementary Informationfor: FastMatrixSquare RootswithApplicationstoGaussianProcessesand BayesianOptimization

Neural Information Processing Systems

We note that all methods incur some sampling error, regardless of the subset size (N). In Fig. S6 we plot the learned hyperparameters of the Precipitation SVGP models: 1)o2 (the kernel outputscale)--which roughly corresponds to variance explained as "signal" in the data; 2)σ2obs--which roughly corresponds to variance explained away as observational noise; and 3)ν (degreesoffreedom)--which controls thetailsofthenoisemodel (lowerν corresponds toheavier tails). As M increases, we find that the observational noise parameter decreases by a factor of 4--downfrom 0.19to0.05--whilethe Fig. S7 is a histogram displaying the msMINRES iterations needed to achieve a relative residual of10 3 when training aM = 5,000SVGP model on the 3droad dataset (subsampled to30,000 datapoints). AsM increases, the kernel outputscale (left) also increases.


Supplementary Information for: Fast Matrix Square Roots with Applications to Gaussian Processes and Bayesian Optimization

Neural Information Processing Systems

Fig. S1 and Fig. S2 are continuations of Figure 1. This work was conducted while David Eriksson was at Uber AI. Figure S2: Randomized SVD relative error at computing In all cases, randomized SVD is unable to achieve a relative error better than about 0. 25. Fig. S3 further demonstrates the effect of preconditioning on msMINRES-CIQ. To further compare msMINRES-CIQ to randomized methods, Fig. S4 plots the empirical covariance Cholesky-based sampling tend to have very similar empirical covariance error. This additional error is due to the randomness in the RFF approximation.


Review for NeurIPS paper: Fast Matrix Square Roots with Applications to Gaussian Processes and Bayesian Optimization

Neural Information Processing Systems

Summary and Contributions: 3rd EDIT: With the notebook now attached to this submission, I am content with the quality of the empirical evaluation. For double precision, the Cholesky decomposition fails less often whereas MINRES needs more iterations to converge. Since this aspect is neither explored nor even mentioned, I recommend to reject this paper. EDIT: In their rebuttal the authors just brushed over my concern that the number of iterations is limited to J 200. It appears this parameter is far more crucial than the authors are willing to admit and reproducing the experiments turned out to be difficult.


A Natural Primal-Dual Hybrid Gradient Method for Adversarial Neural Network Training on Solving Partial Differential Equations

Liu, Shu, Osher, Stanley, Li, Wuchen

arXiv.org Artificial Intelligence

We propose a scalable preconditioned primal-dual hybrid gradient algorithm for solving partial differential equations (PDEs). We multiply the PDE with a dual test function to obtain an inf-sup problem whose loss functional involves lower-order differential operators. The Primal-Dual Hybrid Gradient (PDHG) algorithm is then leveraged for this saddle point problem. By introducing suitable precondition operators to the proximal steps in the PDHG algorithm, we obtain an alternative natural gradient ascent-descent optimization scheme for updating the neural network parameters. We apply the Krylov subspace method (MINRES) to evaluate the natural gradients efficiently. Such treatment readily handles the inversion of precondition matrices via matrix-vector multiplication. A posterior convergence analysis is established for the time-continuous version of the proposed method. The algorithm is tested on various types of PDEs with dimensions ranging from $1$ to $50$, including linear and nonlinear elliptic equations, reaction-diffusion equations, and Monge-Amp\`ere equations stemming from the $L^2$ optimal transport problems. We compare the performance of the proposed method with several commonly used deep learning algorithms such as physics-informed neural networks (PINNs), the DeepRitz method, weak adversarial networks (WANs), etc, for solving PDEs using the Adam and L-BFGS optimizers. The numerical results suggest that the proposed method performs efficiently and robustly and converges more stably.


Efficient first-order predictor-corrector multiple objective optimization for fair misinformation detection

Enouen, Eric, Mathesius, Katja, Wang, Sean, Carr, Arielle, Xie, Sihong

arXiv.org Artificial Intelligence

Multiple-objective optimization (MOO) aims to simultaneously optimize multiple conflicting objectives and has found important applications in machine learning, such as minimizing classification loss and discrepancy in treating different populations for fairness. At optimality, further optimizing one objective will necessarily harm at least another objective, and decision-makers need to comprehensively explore multiple optima (called Pareto front) to pinpoint one final solution. We address the efficiency of finding the Pareto front. First, finding the front from scratch using stochastic multi-gradient descent (SMGD) is expensive with large neural networks and datasets. We propose to explore the Pareto front as a manifold from a few initial optima, based on a predictor-corrector method. Second, for each exploration step, the predictor solves a large-scale linear system that scales quadratically in the number of model parameters and requires one backpropagation to evaluate a second-order Hessian-vector product per iteration of the solver. We propose a Gauss-Newton approximation that only scales linearly, and that requires only first-order inner-product per iteration. This also allows for a choice between the MINRES and conjugate gradient methods when approximately solving the linear system. The innovations make predictor-corrector possible for large networks. Experiments on multi-objective (fairness and accuracy) misinformation detection tasks show that 1) the predictor-corrector method can find Pareto fronts better than or similar to SMGD with less time; and 2) the proposed first-order method does not harm the quality of the Pareto front identified by the second-order method, while further reduce running time.