gradient calculation
- North America > United States > California > San Diego County > La Jolla (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > California > San Diego County > La Jolla (0.04)
- North America > Canada > Quebec > Montreal (0.04)
Reviews: Neural Ordinary Differential Equations
Given the ever increasing importance of AD in both communities, adding to the range of scientific computing primitives for which frameworks such as autograd can efficiently compute derivatives through will hopefully spur more widespread use of gradient based learning and inference methods with ODE models and hopefully spur other frameworks with AD capability in the community such as Stan, TensorFlow and Pytorch to implement adjoint sensitivity methods. The specific suggested applications of the'ODE solver modelling primitive' in ODE-Nets, CNFs and L-ODEs are all interesting demonstrations of some of the computational and modelling advantages that come from using a continuous-time ODE mode; formulation, with in particular the memory savings possible by avoiding the need to compute all intermediate states by recomputing trajectories backwards through time being a possible major gain given that device memory is often currently a bottleneck. While'reversing' the integration to recompute the reverse trajectory is an appealing idea, it would have helped to have more discussion of when this would be expected to breakdown - for example it seems likely that highly chaotic dynamical systems would tend to be problematic as even small errors in the initial backwards steps could soon lead to very large divergences in the reversed trajectories compared to the forward ones. It seems like a useful sanity check in an implementation would be to compare the final state of the reversed trajectory to the initial state of the forward trajectory to check how closely they agree. The submission is generally very well written and presented with a clear expository style, with useful illustrative examples given in the experiments to support the claims made and well thought out figures which help to give visual intuitions about the methods and results.
Stochastic Variance-Reduced Iterative Hard Thresholding in Graph Sparsity Optimization
Fox, Derek, Hernandez, Samuel, Tong, Qianqian
Stochastic optimization algorithms are widely used for large-scale data analysis due to their low per-iteration costs, but they often suffer from slow asymptotic convergence caused by inherent variance. Variance-reduced techniques have been therefore used to address this issue in structured sparse models utilizing sparsity-inducing norms or $\ell_0$-norms. However, these techniques are not directly applicable to complex (non-convex) graph sparsity models, which are essential in applications like disease outbreak monitoring and social network analysis. In this paper, we introduce two stochastic variance-reduced gradient-based methods to solve graph sparsity optimization: GraphSVRG-IHT and GraphSCSG-IHT. We provide a general framework for theoretical analysis, demonstrating that our methods enjoy a linear convergence speed. Extensive experiments validate
- North America > United States > Texas > Brazos County > College Station (0.04)
- North America > United States > North Carolina (0.04)
A Computation and Communication Efficient Method for Distributed Nonconvex Problems in the Partial Participation Setting
Tyurin, Alexander, Richtárik, Peter
We present a new method that includes three key components of distributed optimization and federated learning: variance reduction of stochastic gradients, partial participation, and compressed communication. We prove that the new method has optimal oracle complexity and state-of-the-art communication complexity in the partial participation setting. Regardless of the communication compression feature, our method successfully combines variance reduction and partial participation: we get the optimal oracle complexity, never need the participation of all nodes, and do not require the bounded gradients (dissimilarity) assumption.
A Theoretical Perspective of Machine Learning with Computational Resource Concerns
Conventional theoretical machine learning studies generally assume explicitly or implicitly that there are enough or even infinitely supplied computational resources. In real practice, however, computational resources are usually limited, and the performance of machine learning depends not only on how many data have been received, but also on how many data can be handled with the computational resources available. Note that most current ``intelligent supercomputing'' facilities work like exclusive operating systems, where a fixed amount of resources are allocated to a machine learning task without adaptive scheduling strategies considering important factors such as learning performance demands and learning process status. In this article, we introduce the notion of machine learning throughput, define Computational Resource Efficient Learning (CoRE-Learning) and present a theoretical framework that takes into account the influence of computational resources in learning theory. This framework can be naturally applied to stream learning where the incoming data streams can be potentially endless with overwhelming size and it is impractical to assume that all received data can be handled in time. It may also provide a theoretical perspective for the design of intelligent supercomputing operating systems.