Goto

Collaborating Authors

 sparse


analysis of Algorithm

Neural Information Processing Systems

In this section, we provide a convergence rate analysis for Algorithm 1. Similar to Hazan et al. [36], Algorithm 1 has access to an approximate density oracle and an approximate planner defined below: Visitation density oracle: We assume access to an approximate density estimator that takes in a policy and a density approximation error d 0 as inputs and returns ห†d such that kd ห†d k1 d. Approximate planning oracle: We assume access to an approximate planner that, given any MDP M and error tolerance p 0, returns a policy such that JM() max JM() p. A.1 Proof of Theorem 1 We first give the following proposition that captures certain properties of the proposed objective. The proof is postponed to the end of this section. Taking the above proposition as given for the moment, we prove Theorem 1 following steps similar to those of Hazan et al. [36, Theorem 4.1]. Since k returned by the approximate planning oracle is an p-optimal policy in Mk, we have (1) 1hd k,rki (1) 1hd,rki p for any policy, including?. Therefore, It is straightforward to check that setting 0.1 1, p 0.1, d 0.1 1, 0.1, and the number of iterations K 1 log(10B 1) yields the claim of Theorem 1. Remark 2. Since the temperature parameter k in Proposition 1 goes to zero as k increases, one can show that the expected value of policy returned by Algorithm 1 converges to the maximum performance J(?).


!011Im2Col0 1

Neural Information Processing Systems

We adopt a residual network (ResNet) [23] based feature extractor, with ELU as the activation function. Following [15], we adopt group normalization and instance normalization for better stability of the networks. We adopt the "leave-one-out" training strategy for obtaining the results on each of the categories of MVTec-AD. All experiments are performed with the same settings and hyperparameters. We resize all images to 128 128, and do not perform any data augmentation.



Sparse Winning Tickets are Data-Efficient Image Recognizers

Neural Information Processing Systems

Improving the performance of deep networks in data-limited regimes has warranted much attention. In this work, we empirically show that "winning tickets" (small subnetworks) obtained via magnitude pruning based on the lottery ticket hypothesis [1], apart from being sparse are also effective recognizers in data-limited regimes. Based on extensive experiments, we find that in low data regimes (datasets of 50-100 examples per class), sparse winning tickets substantially outperform the original dense networks. This approach, when combined with augmentations or fine-tuning from a self-supervised backbone network, shows further improvements in performance by as much as 16% (absolute) on low sample datasets and longtailed classification. Further, sparse winning tickets are more robust to synthetic noise and distribution shifts compared to their dense counterparts. Our analysis of winning tickets on small datasets indicates that, though sparse, the networks retain density in the initial layers and their representations are more generalizable.


Dimensionality Reduction of Massive Sparse Datasets Using Coresets

Neural Information Processing Systems

In this paper we present a practical solution with performance guarantees to the problem of dimensionality reduction for very large scale sparse matrices. We show applications of our approach to computing the Principle Component Analysis (PCA) of any n dmatrix, using one pass over the stream of its rows. Our solution uses coresets: a scaled subset of the n rows that approximates their sum of squared distances to every k-dimensional affine subspace. An open theoretical problem has been to compute such a coreset that is independent of both n and d. An open practical problem has been to compute a non-trivial approximation to the PCA of very large but sparse databases such as the Wikipedia document-term matrix in a reasonable time. We answer both of these questions affirmatively. Our main technical result is a new framework for deterministic coreset constructions based on a reduction to the problem of counting items in a stream.


Expanding Sparse Tuning for Low Memory Usage

Neural Information Processing Systems

Parameter-efficient fine-tuning (PEFT) is an effective method for adapting pre-trained vision models to downstream tasks by tuning a small subset of parameters. Among PEFT methods, sparse tuning achieves superior performance by only adjusting the weights most relevant to downstream tasks, rather than densely tuning the whole weight matrix. However, this performance improvement has been accompanied by increases in memory usage, which stems from two factors, i.e., the storage of the whole weight matrix as learnable parameters in the optimizer and the additional storage of tunable weight indexes. In this paper, we propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage. To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices, saving from the costly storage of the whole original matrix. A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes. To maintain the effectiveness of sparse tuning with low-rank matrices, we extend the low-rank decomposition by applying nonlinear kernel functions to the whole-matrix merging. Consequently, we gain an increase in the rank of the merged matrix, enhancing the ability of SNELL in adapting the pre-trained models to downstream tasks. Extensive experiments on multiple downstream tasks show that SNELL achieves state-of-the-art performance with low memory usage, endowing PEFT with sparse tuning to large-scale models.


Select-and-Sample for Spike-and-Slab Sparse Coding

Neural Information Processing Systems

Probabilistic inference serves as a popular model for neural processing. It is still unclear, however, how approximate probabilistic inference can be accurate and scalable to very high-dimensional continuous latent spaces. Especially as typical posteriors for sensory data can be expected to exhibit complex latent dependencies including multiple modes. Here, we study an approach that can efficiently be scaled while maintaining a richly structured posterior approximation under these conditions. As example model we use spike-and-slab sparse coding for V1 processing, and combine latent subspace selection with Gibbs sampling (select-and-sample).


Proximal SCOPE for Distributed Sparse Learning

Neural Information Processing Systems

Distributed sparse learning with a cluster of multiple machines has attracted much attention in machine learning, especially for large-scale applications with high-dimensional data. One popular way to implement sparse learning is to use L1 regularization. In this paper, we propose a novel method, called proximal SCOPE (pSCOPE), for distributed sparse learning with L1 regularization.



80f2f15983422987ea30d77bb531be86-Paper.pdf

Neural Information Processing Systems

Wethenseparate theoptimization process into two steps, corresponding to weight update and structure parameter update. For the former step, we use the conventional chain rule, which can be sparse via exploiting the sparse structure.