Goto

Collaborating Authors

 Statistical Learning


Appendices

Neural Information Processing Systems

Appendix A provides derivations supporting Section 3 in the main paper. In Appendix B, we explain our experimental setup, including dataset preparation and model implementation, in more detail. Finally, Appendix C provides additional results supporting our claims regarding the scalability of our method, together with additional results from the experiments presented in Section 4. In this section we provide detailed derivations of the ST-DGMRF joint distribution, for both firstorder transition models (Section A.1) and higher-order transition models (Section A.2). A.1 Joint distribution The LDS (see Section 2.2 and 3.1 in the main paper) defines a joint distribution over system states First, note that Eq. (1) can be written as a set of linear equations Moving all xk-terms to the left-hand side, we can rewrite this as a matrix-vector multiplication I F1 I F2 I ...... FKI | {z} Empty positions in F represent zero-blocks. Now, we can express x as an affine transformation of ϵ x = F 1c+F 1ϵ, (3) where F 1 exists because det(F) = 1. Since ϵ is distributed as ϵ N(0,Q 1) with Q = diag(Q0,Q1,...,QK), and c is deterministic, we can use the affine property of Gaussian distributions to obtain the joint distribution This reduces both computations and memory requirements. In contrast, the information vector η = Ωµcan be expressed compactly as η = FTQFF 1c = FTQc, (8) which can be computed efficiently using sparse and parallel matrix-vector multiplications on a GPU.



Deep Insights into Noisy Pseudo Labeling on Graph Data

Neural Information Processing Systems

Pseudo labeling (PL) is a wide-applied strategy to enlarge the labeled dataset by self-annotating the potential samples during the training process. Several works have shown that it can improve the graph learning model performance in general. However, we notice that the incorrect labels can be fatal to the graph training process. Inappropriate PL may result in the performance degrading, especially on graph data where the noise can propagate. Surprisingly, the corresponding error is seldom theoretically analyzed in the literature.


Preconditioning Matters: Fast Global Convergence of Non-convex Matrix Factorization via Scaled Gradient Descent

Neural Information Processing Systems

Low-rank matrix factorization (LRMF) is a canonical problem in non-convex optimization, the objective function to be minimized is non-convex and even non-smooth, which makes the global convergence guarantee of gradient-based algorithm quite challenging.



On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms Anonymous Author(s) Affiliation Address email

Neural Information Processing Systems

Stochastic gradient descent (SGD) algorithm is the method of choice in many1 machine learning tasks thanks to its scalability and efficiency in dealing with2 large-scale problems. In this paper, we focus on the shuffling version of SGD3 which matches the mainstream practical heuristics. We show the convergence4 to a global solution of shuffling SGD for a class of non-convex functions un-5 der over-parameterized settings. Our analysis employs more relaxed non-convex6 assumptions than previous literature. Nevertheless, we maintain the desired compu-7 tational complexity as shuffling SGD has achieved in the general convex setting.8 1 Introduction9 In the last decade, neural network-based models have shown great success in many machine learning10 applications such as natural language processing [Collobert and Weston, 2008, Goldberg et al., 2018],11 computer vision and pattern recognition [Goodfellow et al., 2014, He and Sun, 2015].


Langevin Quasi-Monte Carlo

Neural Information Processing Systems

Langevin Monte Carlo (LMC) and its stochastic gradient versions are powerful algorithms for sampling from complex high-dimensional distributions. To sample from a distribution with density π(θ) exp( U(θ)), LMC iteratively generates the next sample by taking a step in the gradient direction U with added Gaussian perturbations. Expectations w.r.t. the target distribution π are estimated by averaging over LMC samples. In ordinary Monte Carlo, it is well known that the estimation error can be substantially reduced by replacing independent random samples by quasi-random samples like low-discrepancy sequences. In this work, we show that the estimation error of LMC can also be reduced by using quasirandom samples. Specifically, we propose to use completely uniformly distributed (CUD) sequences with certain low-discrepancy property to generate the Gaussian perturbations. Under smoothness and convexity conditions, we prove that LMC with a low-discrepancy CUD sequence achieves smaller error than standard LMC. The theoretical analysis is supported by compelling numerical experiments, which demonstrate the effectiveness of our approach.



Knowledge Distillation Performs Partial Variance Reduction

Neural Information Processing Systems

Knowledge distillation is a popular approach for enhancing the performance of "student" models, with lower representational capacity, by taking advantage of more powerful "teacher" models. Despite its apparent simplicity and widespread use, the underlying mechanics behind knowledge distillation (KD) are still not fully understood. In this work, we shed new light on the inner workings of this method, by examining it from an optimization perspective. We show that, in the context of linear and deep linear models, KD can be interpreted as a novel type of stochastic variance reduction mechanism. We provide a detailed convergence analysis of the resulting dynamics, which hold under standard assumptions for both strongly-convex and non-convex losses, showing that KD acts as a form of partial variance reduction, which can reduce the stochastic gradient noise, but may not eliminate it completely, depending on the properties of the "teacher" model. Our analysis puts further emphasis on the need for careful parametrization of KD, in particular w.r.t. the weighting of the distillation loss, and is validated empirically on both linear models and deep neural networks.


Streaming Algorithms and Lower Bounds for Estimating Correlation Clustering Cost

Neural Information Processing Systems

Correlation clustering is a fundamental optimization problem at the intersection of machine learning and theoretical computer science. Motivated by applications to big data processing, recent years have witnessed a flurry of results on this problem in the streaming model. In this model, the algorithm needs to process the input n-vertex graph by making one or few passes over the stream of its edges and using a limited memory, much smaller than the input size. All previous work on streaming correlation clustering has focused on semistreaming algorithms with Ω(n) memory, whereas in this work, we study streaming algorithms with much smaller memory requirements of only polylog(n) bits. This stringent memory requirement is in the same spirit of classical streaming algorithms that instead of recovering a full solution to the problem--which can be prohibitively large with such small memory as is the case in our problem--, aimed to learn certain statistical properties of their inputs.