Goto

Collaborating Authors

 Gradient Descent


Learning in Log-Domain: Subthreshold Analog AI Accelerator Based on Stochastic Gradient Descent

arXiv.org Artificial Intelligence

In recent years, artificial intelligence (AI) has become an integral part of daily life, serving as a transformative tool across various professional domains [1] and driving personal applications through advancements in transformer models that power large language models (LLMs) [2]. However, both training and inference of AI models demand substantial computational and energy resources, which are becoming increasingly challenging to access [3, 4]. While server-class GPUs are effective for training, their energy inefficiency [5] and high costs present significant barriers [6]. Additionally, the environmental impact of energy-intensive AI systems has raised critical concerns about their role in exacerbating climate change [4]. Amdahl's law predicts that performance and efficiency gains are best achieved through innovative application-specific accelerator architectures rather than scaling up multi-core general-purpose processors [7]. Consequently, applicationspecific integrated circuits (ASICs), both digital and analog, have emerged as critical solutions for enabling highefficiency training and inference of artificial neural networks [7, 8, 9]. Digital accelerators are widely adopted for training workloads. Notable examples include the Brainwave Neural Processing Unit (NPU) [10], Google's Tensor Processing Unit (TPU) [11], and low-precision inference accelerators such as YodaNN [5], the Unified Neural Processing Unit (UNPU) [12], and BRein Memory [13].


Personalized Federated Learning for Cellular VR: Online Learning and Dynamic Caching

arXiv.org Artificial Intelligence

Delivering an immersive experience to virtual reality (VR) users through wireless connectivity offers the freedom to engage from anywhere at any time. Nevertheless, it is challenging to ensure seamless wireless connectivity that delivers real-time and high-quality videos to the VR users. This paper proposes a field of view (FoV) aware caching for mobile edge computing (MEC)-enabled wireless VR network. In particular, the FoV of each VR user is cached/prefetched at the base stations (BSs) based on the caching strategies tailored to each BS. Specifically, decentralized and personalized federated learning (DP-FL) based caching strategies with guarantees are presented. Considering VR systems composed of multiple VR devices and BSs, a DP-FL caching algorithm is implemented at each BS to personalize content delivery for VR users. The utilized DP-FL algorithm guarantees a probably approximately correct (PAC) bound on the conditional average cache hit. Further, to reduce the cost of communicating gradients, one-bit quantization of the stochastic gradient descent (OBSGD) is proposed, and a convergence guarantee of $\mathcal{O}(1/\sqrt{T})$ is obtained for the proposed algorithm, where $T$ is the number of iterations. Additionally, to better account for the wireless channel dynamics, the FoVs are grouped into multicast or unicast groups based on the number of requesting VR users. The performance of the proposed DP-FL algorithm is validated through realistic VR head-tracking dataset, and the proposed algorithm is shown to have better performance in terms of average delay and cache hit as compared to baseline algorithms.


Review for NeurIPS paper: Projected Stein Variational Gradient Descent

Neural Information Processing Systems

Strengths: Preface I understand that reviews that claim that a method is not sufficiently novel or significant are often subjective and are difficult for authors to rebut. To make my review easier to engage with, I'm offering the following criteria along which I assess "significance" of a paper: (*i*) Does the paper offer a novel, non-obvious theoretical insight in the form of a proof or derivation? I will touch on these three criteria in my comments below and mark my comment accordingly. Relevance Bayesian inference applied to a variety of problems is an active area of research and the paper under review proposes a novel algorithm for fast convergence to a posterior distribution in Bayesian inference problems. While the proposed method is still limited in the parameter dimension, it improves on related methods and makes stein variational gradient descent more practically relevant.


Review for NeurIPS paper: Projected Stein Variational Gradient Descent

Neural Information Processing Systems

Leveraging low-dimensional structure in approximate inference algorithms is an interesting area of study, and this adaptation of SVGD is a promising approach. There were concerns about lack of clarity and presentation of the algorithm, as well as theoretical justification and motivation of this procedure.


Reviews: Global Convergence of Gradient Descent for Deep Linear Residual Networks

Neural Information Processing Systems

Response to authors' feedback: I thank the authors for the rebuttal. My score remains the same. With this initialization, the networks are shown to converge linearly to zero loss, under conditions (for discrete-time GD) that are different from and perhaps conceptually simpler than previous works. For instance, compared to reference [2] (Arora et al "A convergence analysis of gradient descent for deep linear neural networks", ICLR 2019), this work removes completely the delta-balanced condition in [2] by showing that this condition actually holds, for most layers, on the GD trajectory (Lemma 4.2 and Eq. While certain elements have already been seen in previous works (e.g. the property in Lemma 4.2 is similar to the delta-balanced condition in [2], or the requirement of zero initialization for the last layer's weight has been seen in "fixup initialization" of reference [21] in the context of residual networks), I think the proposed initialization as well as the convergence analysis here deserve credits for novelty.


Reviews: Global Convergence of Gradient Descent for Deep Linear Residual Networks

Neural Information Processing Systems

The reviewers appreciated the work on the initialization even if they deemed it incremental. The experiments on the nonlinear network in the rebuttal was useful and I encourage the authors to expand the experimental section using more realistic setups to show how the theory matters in practice.


Reviews: A Simple Baseline for Bayesian Uncertainty in Deep Learning

Neural Information Processing Systems

This paper presents SWAG, a method that uses the iterates of a Polyak-averaging-like stochastic gradient descent to approximate the posterior distribution of a neural network. It is presented as a simple baseline for uncertainty in large deep neural networks and the authors demonstrate its effectiveness on a variety of large scale tasks including residual networks on CIFAR and Imagenet. The strengths of this paper are: - it is indeed a simple baseline for a promising area of research that is really lacking good baselines - experiments are thorough and on benchmarks that are large and interesting to the wider deep learning community - the authors empirically evaluate the quality of their approximation and provide some analysis The main criticism of this paper is that it is not really Bayesian from a purist perspective. R3 is correct to point out that the presented approximation can not actually capture the true posterior as shown by Mandt et al. (Stochastic Gradient Descent as Approximate Bayesian Inference). The language of the paper at times implies otherwise and R3 is right to point this out (e.g.


On the Convergence Rate of Training Recurrent Neural Networks

Neural Information Processing Systems

How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory only covers networks with one hidden layer, so can we go deeper? In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing. They are harder to analyze than feedforward neural networks, because the \emph{same} recurrent unit is repeatedly applied across the entire time horizon of length L, which is analogous to feedforward networks of depth L .


Reviews: Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Neural Information Processing Systems

The paper was proofread, well-structured, and very clear. The experiments were clearly described in detail, and provided relevant results. Below we outline some detailed comments of the results. In particular, Chizat and Bach prove that the training of an NTK parameterized network is closely modeled by "lazy training" (their terminology for a linearized model). This paper is not referenced in the related work section.


Reviews: Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Neural Information Processing Systems

This paper studies deep neural networks in the regime where the layer widths grow to infinity. Its main contribution is to show that the dynamics of gradient descent for optimizing an infinite width neural network can be explained by the first-order Taylor expansion of the network around its initial parameters, given by the NTK of Jacot et al. Reviewers all agreed this is a valuable contribution which helps the current efforts on understanding the inner workings of gradient descent on large neural networks and its role with regards to generalisation. Despite some concerns about the applicability of this regime to explain the empirical performance of large deep nets and some concurrent work (Chizat and Bach), the authors successfully addressed these concerns in the rebuttal and therefore the AC recommends acceptance.