Goto

Collaborating Authors

 forster transform


Forster Decomposition and Learning Halfspaces with Noise

Neural Information Processing Systems

A Forster transform is an operation that turns a multivariate distribution into one with good anti-concentration properties. While a Forster transform does not always exist, we show that any distribution can be efficiently decomposed as a disjoint mixture of few distributions for which a Forster transform exists and can be computed efficiently. As the main application of this result, we obtain the first polynomial-time algorithm for distribution-independent PAC learning of halfspaces in the Massart noise model with strongly polynomial sample complexity, i.e., independent of the bit complexity of the examples. Previous algorithms for this learning problem incurred sample complexity scaling polynomially with the bit complexity, even though such a dependence is not information-theoretically necessary.


Review for NeurIPS paper: Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks

Neural Information Processing Systems

Additional Feedback: Line 1: Change to "Natural Gradient Descent..." Line 10, 11: "the function space" should just be "function space" Line 15: it might be worth pointing out here and/or in the intro that a special kind of data preprocessing (the "Forster transform") is required to get this result for K-FAC in general Line 16, 46: "under some assumptions"/"under specific conditions" should perhaps be replaced with "under some approximating assumptions". AFAIK the "gradient independence assumption" doesn't have any rigorous justification and might not even be true in practice. Line 69: "New insights and perspectives on the natural gradient method" also argues that the empirical Fisher is a poor substitute for the "true" one. Line 71: first quotation make is backwards Line 79: delete "firing" here Line 88: "We normalize each sample by" should be "We normalize each sample so that" Line 90: "we overview" should be "we give an overview of" Line 116: Although the use of damping in the context of NTK theory can be explained this way, damping has a larger role in second order optimization in general (where NTK theory doesn't necessarily apply). The way you are describing it though, it sounds like you are saying its use is fully explained by this theory, and I would suggest you change this.


Forster Decomposition and Learning Halfspaces with Noise

Neural Information Processing Systems

A Forster transform is an operation that turns a multivariate distribution into one with good anti-concentration properties. While a Forster transform does not always exist, we show that any distribution can be efficiently decomposed as a disjoint mixture of few distributions for which a Forster transform exists and can be computed efficiently. As the main application of this result, we obtain the first polynomial-time algorithm for distribution-independent PAC learning of halfspaces in the Massart noise model with strongly polynomial sample complexity, i.e., independent of the bit complexity of the examples. Previous algorithms for this learning problem incurred sample complexity scaling polynomially with the bit complexity, even though such a dependence is not information-theoretically necessary.


A Strongly Polynomial Algorithm for Approximate Forster Transforms and its Application to Halfspace Learning

arXiv.org Artificial Intelligence

The Forster transform is a method of regularizing a dataset X (in particular, by placing it in radial isotropic position) while maintaining some of its essential properties. Forster transforms have been an essential tool in a diverse range of settings, including functional analysis [Bar98, GGdOW17], communication complexity [For02], coding theory [DSW17], mixed determinant/volume approximation [GS02], learning theory [HM13, HKLM20, DKT21, DPT21] and the Paulsen problem in frame theory [KLLR18, HM19]. The reader is referred to [AKS20] for a more detailed discussion. Known algorithms for computing (approximate) Forster transforms [HM13, AKS20, DKT21] rely on black-box convex optimization (e.g., the ellipsoid algorithm) and consequently have weakly polynomial runtimes. Here we study the question of whether Forster transforms can be computed in strongly polynomial time. We then leverage Forster transforms for the problem of PAC learning halfspaces (both in the realizable setting and in the presence of semi-random label noise). Intuitively speaking, a Forster transform is a mapping that turns a dataset into one with good anti-concentration properties.