Goto

Collaborating Authors

 hessian matrix


Boosting Adversarial Transferability by Achieving Flat Local Maxima

Neural Information Processing Systems

Specifically, we randomly sample an example and adopt a first-order procedure to approximate the Hessian/vector product, which makes computing more efficient by interpolating two neighboring gradients.


Double Randomized Underdamped Langevin with Dimension-Independent Convergence Guarantee Y uanshi Liu, Cong Fang, Tong Zhang School of Intelligence Science and Technology, Peking University

Neural Information Processing Systems

Sampling from a high-dimensional distribution serves as one of the key components in statistics, machine learning, and scientific computing, and constitutes the foundation of the fields including Bayesian statistics and generative models [Liu and Liu, 2001, Brooks et al., 2011, Song et al.,


Double Randomized Underdamped Langevin with Dimension-Independent Convergence Guarantee Y uanshi Liu, Cong Fang, Tong Zhang School of Intelligence Science and Technology, Peking University

Neural Information Processing Systems

Sampling from a high-dimensional distribution serves as one of the key components in statistics, machine learning, and scientific computing, and constitutes the foundation of the fields including Bayesian statistics and generative models [Liu and Liu, 2001, Brooks et al., 2011, Song et al.,


A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point

Couto, Carlos, Mourão, José, Figueiredo, Mário A. T., Ribeiro, Pedro

arXiv.org Machine Learning

Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.


Comparing BFGS and OGR for Second-Order Optimization

Przybysz, Adrian, Kołek, Mikołaj, Sobota, Franciszek, Duda, Jarek

arXiv.org Artificial Intelligence

Across standard test functions and ablations with/without line search, OGR variants match or outperform BFGS in final objective and step efficiency, with particular gains in nonconvex landscapes where saddle handling matters. Exact Hessians (via AD) are used only as an oracle baseline to evaluate estimation quality, not to form steps. II. Online Gradient Regression (OGR) Online Gradient Regression (OGR) is a second-order optimization framework that accelerates stochastic gradient descent (SGD) by online least-squares regression of noisy gradients to infer local curvature and the distance to a stationary point [3]. The central assumption is that, in a small neighborhood, the objective F (θ) is well-approximated by a quadratic model, so the gradient varies approximately linearly with the parameters. OGR maintains exponentially weighted statistics of recent (θ t, g t) pairs and updates a local model each iteration at negligible extra cost compared to computing the gradient itself [2], [3]. A. Direct multivariate approach In given time T, based on recent gradients g t R d and positions θ t R d for t < T, we would like to locally approximate behavior with 2nd order polynomial using parametrization: f (θ) = h + 1 2 (θ p) T H(θ p) f = H(θ p) for Hessian H R d d and p R d position of saddle or extremum. For local behavior we will work on averages with weights w t further decreasing exponentially, defining averages: v null t


An hybrid stochastic Newton algorithm for logistic regression

Bercu, Bernard, Fredes, Luis, Gbaguidi, Eméric

arXiv.org Machine Learning

In this paper, we investigate a second-order stochastic algorithm for solving large-scale binary classification problems. We propose to make use of a new hybrid stochastic Newton algorithm that includes two weighted components in the Hessian matrix estimation: the first one coming from the natural Hessian estimate and the second associated with the stochastic gradient information. Our motivation comes from the fact that both parts evaluated at the true parameter of logistic regression, are equal to the Hessian matrix. This new formulation has several advantages and it enables us to prove the almost sure convergence of our stochastic algorithm to the true parameter. Moreover, we significantly improve the almost sure rate of convergence to the Hessian matrix. Furthermore, we establish the central limit theorem for our hybrid stochastic Newton algorithm. Finally, we show a surprising result on the almost sure convergence of the cumulative excess risk.





How Sparse Can We Prune A Deep Network: A Fundamental Limit Perspective

Neural Information Processing Systems

Network pruning is a commonly used measure to alleviate the storage and computational burden of deep neural networks. However, the fundamental limit of network pruning is still lacking. To close the gap, in this work we'll take a first-principles approach, i.e. we'll directly impose the sparsity constraint on the loss function and leverage the framework of statistical dimension in convex geometry, thus enabling us to characterize the sharp phase transition point, which can be regarded as the fundamental limit of the pruning ratio. Through this limit, we're able to identify two key factors that determine the pruning ratio limit, namely, weight magnitude and network sharpness .