Goto

Collaborating Authors

 resnet-50


KOALA++: Efficient Kalman-Based Optimization with Gradient-Covariance Products

Neural Information Processing Systems

We propose KOALA++, a scalable Kalman-based optimization algorithm that explicitly models structured gradient uncertainty in neural network training. Unlike second-order methods, which rely on expensive second order gradient calculation, our method directly estimates the parameter covariance matrix by recursively updating compact gradient covariance products. This design improves upon the original KOALA framework that assumed diagonal covariance by implicitly capturing richer uncertainty structure without storing the full covariance matrix and avoiding large matrix inversions. Across diverse tasks, including image classification and language modeling, KOALA++ achieves accuracy on par or better than state-of-the-art first-and second-order optimizers while maintaining the efficiency of first-order methods.



Combining equation (4) with equation (5), we have: L(fฮธ) nY

Neural Information Processing Systems

A.1 Theoretical Proof The following is proof for Theorem 1 and 2 on Upper Bound on Lipschitz Constant of a DNN with Gaussian Distributed Weights, which is inspired by [67-69]. Let A be an (N n) matrix whose elements are independent standard normal random variables. Then, N n E[ฮปmin(A)] E[ฮปmax(A)] N+ n, where ฮปmin and ฮปmax denote the minimum and maximum singular values of A, respectively, and E[ ] represents the expected value. This can be extended to convolutional neural networks (CNN). Using doubly block circulant matrix the convolution operation can be represented by matrix multiplication.





Learning Best Combination for Efficient N: MSparsity

Neural Information Processing Systems

By forcing at most N out of M consecutive weights to be non-zero, the recent N:M network sparsity has received increasing attention for its two attractive advantages: 1) Promising performance at a high sparsity.


Adversarial Style Augmentation for Domain Generalized Urban-Scene Segmentation (Supplementary Material)

Neural Information Processing Systems

For the synthetic-to-real domain generalization (DG), we use one of the synthetic datasets (GTAV [12] or SYNTHIA [13]) as the source domain and evaluate the model performance on three real-world datasets (CityScapes [2], BDD-100K [16], and Mapillary [11]). GTAV [12] contains 24,966 images with the size of 1914 1052. It is splited into 12,403, 6,382, and 6,181 images for training, validating, and testing. SYNTHIA [13] contains 9,400 images of 960 720, where 6,580 images are used for training. We use the validation sets of the three real-world datasets for evaluation.