Not enough data to create a plot.
Try a different view from the menu above.
Appendix
The following lemma demonstrates the convergence property of the SGD framework when the gradient estimator v(x) is unbiased and has bounded variance. Differently, these papers assumed that the norm of the gradient estimator, i.e., v(x) One can also refer to the recent summary [6] for more general results on SGD. When the variance of v(x) is of order O(ϵ), one can use stepsizes that are independent of ϵ to guarantee ϵ-optimality or ϵ-stationarity. The algorithm would behave similar like gradient descent. Suppose that it holds true for t. ( We prove the case when F (x) is convex. This section demonstrates the bias, variance, and per-iteration cost of the L-SGD and the MLMCbased gradient estimators.
HYDRA: Pruning Adversarially Robust Neural Networks
In safety-critical but computationally resource-constrained applications, deep learning faces two key challenges: lack of robustness against adversarial attacks and large neural network size (often millions of parameters). While the research community has extensively explored the use of robust training and network pruning independently to address one of these challenges, only a few recent works have studied them jointly. However, these works inherit a heuristic pruning strategy that was developed for benign training, which performs poorly when integrated with robust training techniques, including adversarial training and verifiable robust training. To overcome this challenge, we propose to make pruning techniques aware of the robust training objective and let the training objective guide the search for which connections to prune. We realize this insight by formulating the pruning objective as an empirical risk minimization problem which is solved efficiently using SGD.
e3a54649aeec04cf1c13907bc6c5c8aa-AuthorFeedback.pdf
BBBVI took about 3 hours per dataset. The NOMT took less than 5 seconds per dataset. Reviewer 3 noted that the spike-and-slab model does not satisfy the non-overlapping support assumption of Theorem 1. Reviewer 2 pointed out that there is an interesting asymmetry in Theorem 1 with respect to component K. It would be possible to have a "symmetric" version of the theorem, but it would describe a Reviewer 2 suggested using reconstruction error as a metric for the sparse PCA application. I will include a discussion of these similarities and differences in the revision. Reviewer 1 asked if the supports of the mixture distributions must defined a priori.
Fast Iterative Hard Thresholding Methods with Pruning Gradient Computations Yasutoshi Ida 1
We accelerate the iterative hard thresholding (IHT) method, which finds k important elements from a parameter vector in a linear regression model. Although the plain IHT repeatedly updates the parameter vector during the optimization, computing gradients is the main bottleneck. Our method safely prunes unnecessary gradient computations to reduce the processing time. The main idea is to efficiently construct a candidate set, which contains k important elements in the parameter vector, for each iteration. Specifically, before computing the gradients, we prune unnecessary elements in the parameter vector for the candidate set by utilizing upper bounds on absolute values of the parameters. Our method guarantees the same optimization results as the plain IHT because our pruning is safe. Experiments show that our method is up to 73 times faster than the plain IHT without degrading accuracy.
Improving Self-Supervised Learning by Characterizing Idealized Representations
Despite the empirical successes of self-supervised learning (SSL) methods, it is unclear what characteristics of their representations lead to high downstream accuracies. In this work, we characterize properties that SSL representations should ideally satisfy. Specifically, we prove necessary and sufficient conditions such that for any task invariant to given data augmentations, desired probes (e.g., linear or MLP) trained on that representation attain perfect accuracy. These requirements lead to a unifying conceptual framework for improving existing SSL methods and deriving new ones. For contrastive learning, our framework prescribes simple but significant improvements to previous methods such as using asymmetric projection heads. For non-contrastive learning, we use our framework to derive a simple and novel objective. Our resulting SSL algorithms outperform baselines on standard benchmarks, including SwAV+multicrops on linear probing of ImageNet.