Goto

Collaborating Authors

References

Neural Information Processing Systems

Author Feedback of Positional Normalization We thank all reviewers for their insightful and constructive comments. Reviewer #1 and #2: Since the original paper submission, we have explored PONO in the context of other applications and model architectures. Although these results are still preliminary, they are consistently positive and highly encouraging. We apply one PONO-MS to AOD-Net [4] using the offcial codebase and test on the same evaluation datasets provided in the paper. Here we add one PONO-MS to the visual processing part of the official codebase on Habitat Platform [6]. We first build an autoencoder consisting of 20 convolutional layers and test on Set5.


An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods Tamer Başar

Neural Information Processing Systems

In this paper, we revisit and improve the convergence of policy gradient (PG), natural PG (NPG) methods, and their variance-reduced variants, under general smooth policy parametrizations. More specifically, with the Fisher information matrix of the policy being positive definite: i) we show that a state-of-the-art variance-reduced PG method, which has only been shown to converge to stationary points, converges to the globally optimal value up to some inherent function approximation error due to policy parametrization; ii) we show that NPG enjoys a lower sample complexity; iii) we propose SRVR-NPG, which incorporates variancereduction into the NPG update. Our improvements follow from an observation that the convergence of (variance-reduced) PG and NPG methods can improve each other: the stationary convergence analysis of PG can be applied to NPG as well, and the global convergence analysis of NPG can help to establish the global convergence of (variance-reduced) PG methods.


An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods Tamer Başar

Neural Information Processing Systems

In this paper, we revisit and improve the convergence of policy gradient (PG), natural PG (NPG) methods, and their variance-reduced variants, under general smooth policy parametrizations. More specifically, with the Fisher information matrix of the policy being positive definite: i) we show that a state-of-the-art variance-reduced PG method, which has only been shown to converge to stationary points, converges to the globally optimal value up to some inherent function approximation error due to policy parametrization; ii) we show that NPG enjoys a lower sample complexity; iii) we propose SRVR-NPG, which incorporates variancereduction into the NPG update. Our improvements follow from an observation that the convergence of (variance-reduced) PG and NPG methods can improve each other: the stationary convergence analysis of PG can be applied to NPG as well, and the global convergence analysis of NPG can help to establish the global convergence of (variance-reduced) PG methods.


convergence of several policy gradient methods, whose novelty is summarized in Lines 210-212 and further explained) sample

Neural Information Processing Systems

R1.1...these analysis mainly come from the existing work...the novelty is very limited. Our proposed SRVR-NPG has a better complexity than SRVR-PG (Remark 4.13). This paper focuses on laying the theoretical foundation for the global convergence of policy gradient methods, as [1,15,26,47]. Note that none of [1,15,26,47] has numerical results. But still, we include a numerical result in the figure on the left.


A Maximum entropy calculation Under the constraint on the borders, τ and τ

Neural Information Processing Systems

Average pixel displacement magnitude δ We derive here the large-c asymptotic behavior of δ (Eq.3). The asymptotic relations for τ that are reported in the main text are computed in a similar fashion. In Fig.8, we check the agreement between asymptotic prediction and empirical measurements. If δ 1, our results strongly depend on the choice of interpolation method. Condition for diffeomorphism in the (T, c) plane For a given value of c, there exists a temperature scale beyond which the transformation is not injective anymore, affecting the topology of the image and creating spurious boundaries, see Fig.9a-c for an illustration.


Relative stability toward indicates performance in deep nets

Neural Information Processing Systems

Understanding why deep nets can classify data in large dimensions remains a challenge. It has been proposed that they do so by becoming stable to diffeomorphisms, yet existing empirical measurements support that it is often not the case. We revisit this question by defining a maximum-entropy distribution on diffeomorphisms, that allows to study typical diffeomorphisms of a given norm. We confirm that stability toward diffeomorphisms does not strongly correlate to performance on benchmark data sets of images.


Smoothed Online Convex Optimization Based on Discounted-Normal-Predictor National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China

Neural Information Processing Systems

In this paper, we investigate an online prediction strategy named as Discounted-Normal-Predictor [Kapralov and Panigrahy, 2010] for smoothed online convex optimization (SOCO), in which the learner needs to minimize not only the hitting cost but also the switching cost. In the setting of learning with expert advice, Daniely and Mansour [2019] demonstrate that Discounted-Normal-Predictor can be utilized to yield nearly optimal regret bounds over any interval, even in the presence of switching costs. Inspired by their results, we develop a simple algorithm for SOCO: Combining online gradient descent (OGD) with different step sizes sequentially by Discounted-Normal-Predictor. Despite its simplicity, we prove that it is able to minimize the adaptive regret with switching cost, i.e., attaining nearly optimal regret with switching cost on every interval. By exploiting the theoretical guarantee of OGD for dynamic regret, we further show that the proposed algorithm can minimize the dynamic regret with switching cost in every interval.


Searching the Search Space of Vision Transformer-- -- Supplementary Material-- -- Minghao Chen

Neural Information Processing Systems

This supplementary material contains additional details of Section 2.4, 3 and 4.4 and a discussion about the broader impacts of this paper. The details include: Searching in the searched space. We provide the details of the two steps for vision transformer search: (1) Supernet training without resource constraints; (2) Evolution search under resource constraint. Q-K-V dimension could be smaller than the embedding dimension. We suppose the underlying reasons might be that the feature maps of the different heads are similar in deeper layers.


Searching the Search Space of Vision Transformer Minghao Chen

Neural Information Processing Systems

Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures. In this paper, we propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space. The central idea is to gradually evolve different search dimensions guided by their E-T Error computed using a weight-sharing supernet. Moreover, we provide design guidelines of general vision transformers with extensive analysis according to the space searching process, which could promote the understanding of vision transformer. Remarkably, the searched models, named S3 (short for Searching the Search Space), from the searched space achieve superior performance to recently proposed models, such as Swin, DeiT and ViT, when evaluated on ImageNet. The effectiveness of S3 is also illustrated on object detection, semantic segmentation and visual question answering, demonstrating its generality to downstream vision and vision-language tasks. Code and models will be available at here.


A Proof of Lemma 1: Objective Inconsistency in Quadratic Model (x) = 1 0 (13) 2 x

Neural Information Processing Systems

When µ = 0, the algorithm will reduce to FedAvg. For simplicity, we only consider the case when all devices participate in the each round. Here, we complete the proof of Lemma 1. B.1 SGD with Proximal Updates In this case, we can write the update rule of local models as follows: [ ( x C.1 Preliminaries For the ease of writing, let us define a surrogate objective function F (x) = C.5 Bounding the Difference Between Server Gradient and Normalized Gradient Recall the definition of h Now, we are ready to derive the final result. C.6 Final Results Plugging (87) back into (64), we have C. (97) C.7 Constraint on Local Learning Rate Here, let us summarize the constraints on local learning rate: ηL 1, Lemma 4. One can manually construct a strongly convex objective function such that FedAvg with heterogeneous local updates cannot converge to its global optimum. In particular, the gradient norm of the objective function does not vanish as learning rate approaches to zero.