agnostically
Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals
Surbhi Goel, Sushrut Karmalkar, Adam Klivans
Here we consider the more realistic scenario of empirical risk minimization or learning a ReLU with noise (often referred to as agnostically learning a ReLU). We assume that a learner has access to a training set from a joint distribution D on Rd R where the marginal distribution on Rd is Gaussian but the distribution on the labels can be arbitrary within [0,1].
- North America > United States (0.05)
- North America > Canada > Quebec > Montreal (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (3 more...)
- North America > Canada > Ontario > Toronto (0.15)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Canada (0.04)
Statistical-Query Lower Bounds via Functional Gradients
We give the first statistical-query lower bounds for agnostically learning any non-polynomial activation with respect to Gaussian marginals (e.g., ReLU, sigmoid, sign). For the specific problem of ReLU regression (equivalently, agnostically learning a ReLU), we show that any statistical-query algorithm with tolerance $n^{-(1/\epsilon)^b}$ must use at least $2^{n^c} \epsilon$ queries for some constants $b, c > 0$, where $n$ is the dimension and $\epsilon$ is the accuracy parameter. Our results rule out {\em general} (as opposed to correlational) SQ learning algorithms, which is unusual for real-valued learning problems. Our techniques involve a gradient boosting procedure for ``amplifying'' recent lower bounds due to Diakonikolas et al.\ (COLT 2020) and Goel et al.\ (ICML 2020) on the SQ dimension of functions computed by two-layer neural networks. The crucial new ingredient is the use of a nonstandard convex functional during the boosting procedure. This also yields a best-possible reduction between two commonly studied models of learning: agnostic learning and probabilistic concepts.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > Texas > Travis County > Austin (0.14)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- North America > Canada (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Texas > Travis County > Austin (0.14)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- North America > Canada (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Texas > Travis County > Austin (0.05)
- North America > Canada > Quebec > Montreal (0.04)
- (3 more...)
- North America > Canada > Ontario > Toronto (0.15)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)