Understanding the Under-Coverage Bias in Uncertainty Estimation
Estimating the data uncertainty in regression tasks is often done by learning a quantile function or a prediction interval of the true label conditioned on the input. It is frequently observed that quantile regression--a vanilla algorithm for learning quantiles with asymptotic guarantees--tends to under-cover than the desired coverage level in reality. While various fixes have been proposed, a more fundamental understanding of why this under-coverage bias happens in the first place remains elusive. In this paper, we present a rigorous theoretical study on the coverage of uncertainty estimation algorithms in learning quantiles. We prove that quantile regression suffers from an inherent under-coverage bias, in a vanilla setting where we learn a realizable linear quantile function and there is more data than parameters. More quantitatively, for ฮฑ > 0.5 and small d/n, the ฮฑ-quantile learned by quantile regression roughly achieves coverage ฮฑ (ฮฑ 1/2) d/n regardless of the noise distribution, where d is the input dimension and n is the number of training data. Our theory reveals that this under-coverage bias stems from a certain highdimensional parameter estimation error that is not implied by existing theories on quantile regression. Experiments on simulated and real data verify our theory and further illustrate the effect of various factors such as sample size and model capacity on the under-coverage bias in more practical setups.
Understanding the Under-Coverage Bias in Uncertainty Estimation
Estimating the data uncertainty in regression tasks is often done by learning a quantile function or a prediction interval of the true label conditioned on the input. It is frequently observed that quantile regression--a vanilla algorithm for learning quantiles with asymptotic guarantees--tends to under-cover than the desired coverage level in reality. While various fixes have been proposed, a more fundamental understanding of why this under-coverage bias happens in the first place remains elusive. In this paper, we present a rigorous theoretical study on the coverage of uncertainty estimation algorithms in learning quantiles. We prove that quantile regression suffers from an inherent under-coverage bias, in a vanilla setting where we learn a realizable linear quantile function and there is more data than parameters. More quantitatively, for ฮฑ > 0.5 and small d/n, the ฮฑ-quantile learned by quantile regression roughly achieves coverage ฮฑ (ฮฑ 1/2) d/n regardless of the noise distribution, where d is the input dimension and n is the number of training data. Our theory reveals that this under-coverage bias stems from a certain highdimensional parameter estimation error that is not implied by existing theories on quantile regression. Experiments on simulated and real data verify our theory and further illustrate the effect of various factors such as sample size and model capacity on the under-coverage bias in more practical setups.
Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization
Meisam Razaviyayn, Mingyi Hong, Zhi-Quan Luo, Jong-Shi Pang
Consider the problem of minimizing the sum of a smooth (possibly non-convex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is updated while the remaining variables are held fixed. With the recent advances in the developments of the multi-core parallel processing technology, it is desirable to parallelize the BCD method by allowing multiple blocks to be updated simultaneously at each iteration of the algorithm. In this work, we propose an inexact parallel BCD approach where at each iteration, a subset of the variables is updated in parallel by minimizing convex approximations of the original objective function. We investigate the convergence of this parallel BCD method for both randomized and cyclic variable selection rules. We analyze the asymptotic and non-asymptotic convergence behavior of the algorithm for both convex and non-convex objective functions. The numerical experiments suggest that for a special case of Lasso minimization problem, the cyclic block selection rule can outperform the randomized rule.
Supplementary material
In order to show the deterministic scaling of online SGD under a proper chosen time scale, we will make use of a convergence result by [21, 31], which is adapted below in Theorem A.1. Theorem A.1 (Deterministic scaling limit of stochastic processes). S where ฮฉ() is the solution of Eq.(22). The reader interested in the proof is referred to the supplementary materials of [21, 31]. Although the theorem wasn't originally proven in the p setting, a glance at its proof shows that it still holds upon replacing C(ฯ) by C(p, ฯ) in Assumption A.1.1 and A.1.2, as well as Equation (23).
Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks
Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badlygeneralizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.
Training Neural Networks is R-complete
Given a neural network, training data, and a threshold, finding weights for the neural network such that the total error is below the threshold is known to be NP-hard. We determine the algorithmic complexity of this fundamental problem precisely, by showing that it is R-complete. This means that the problem is equivalent, up to polynomial time reductions, to deciding whether a system of polynomial equations and inequalities with integer coefficients and real unknowns has a solution. If, as widely expected, R is strictly larger than NP, our work implies that the problem of training neural networks is not even in NP. Neural networks are usually trained using some variation of backpropagation. The result of this paper offers an explanation why techniques commonly used to solve big instances of NP-complete problems seem not to be of use for this task. Examples of such techniques are SAT solvers, IP solvers, local search, dynamic programming, to name a few general ones.
A Tightening the lower bound
We now introduce a modification to our lower bound that does make the bound tight. This new lower bound will be more complex than the one introduced above and we have not yet successfully designed an algorithm for maximizing it. Nonetheless, we believe that presenting the bound may prove useful for the design of future model-based RL algorithms. Similar learned discount factors have been studied in previous work on model-free RL [39]. This new lower bound, which differs from our main lower bound by the learnable discount factor, does provide a tight bound on the expected return objective: Lemma A.1. The proof is presented in Appendix B.4. One important limitation of this result is that the learned dynamics that maximize this lower bound to make the bound tight may be non-Markovian. Intriguingly, this analysis suggests that using non-Markovian models, such as RNNs and transformers, may accelerate learning on Markovian tasks.
A Details of the proposed method
We can construct the pseudoinverse decoders for a wide range of neural network architectures. In addition, one may choose activation functions whose Im() 6= R, such as tanh. However, in that case, we must ensure that the input value to the pseudoinverse decoder is in Im() (in case of tanh, it is (1, 1)); otherwise, the computation would be invalid. Besides, similar to the Dirichlet encoder and pseudoinverse decoder, we could define the specific encoder and decoder for the Neumann boundary condition. However, this is not included in the contributions of our work because it does not improve the performance of our model, which may be because the Neumann boundary condition is a soft constraint in contrast to the Dirichlet one and expressive power seems more important than that inductive bias.
Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics
Sergey None Levine, Pieter Abbeel
We present a policy search method that uses iteratively refitted local linear models to optimize trajectory distributions for large, continuous problems. These trajectory distributions can be used within the framework of guided policy search to learn policies with an arbitrary parameterization. Our method fits time-varying linear dynamics models to speed up learning, but does not rely on learning a global model, which can be difficult when the dynamics are complex and discontinuous. We show that this hybrid approach requires many fewer samples than model-free methods, and can handle complex, nonsmooth dynamics that can pose a challenge for model-based techniques. We present experiments showing that our method can be used to learn complex neural network policies that successfully execute simulated robotic manipulation tasks in partially observed environments with numerous contact discontinuities and underactuation.