Goto

Collaborating Authors

 optimal learning


On Optimal Learning Under Targeted Data Poisoning

Neural Information Processing Systems

Consider the task of learning a hypothesis class $\mathcal{H}$ in the presence of an adversary that can replace up to an $\eta$ fraction of the examples in the training set with arbitrary adversarial examples. The adversary aims to fail the learner on a particular target test point $x$ which is \emph{known} to the adversary but not to the learner. In this work we aim to characterize the smallest achievable error $\epsilon=\epsilon(\eta)$ by the learner in the presence of such an adversary in both realizable and agnostic settings.


Thinking Outside the Ball: Optimal Learning with Gradient Descent for Generalized Linear Stochastic Convex Optimization

Neural Information Processing Systems

We consider linear prediction with a convex Lipschitz loss, or more generally, stochastic convex optimization problems of generalized linear form, i.e.~where each instantaneous loss is a scalar convex function of a linear function. We show that in this setting, early stopped Gradient Descent (GD), without any explicit regularization or projection, ensures excess error at most $\varepsilon$ (compared to the best possible with unit Euclidean norm) with an optimal, up to logarithmic factors, sample complexity of $\tilde{O}(1/\varepsilon^2)$ and only $\tilde{O}(1/\varepsilon^2)$ iterations. This contrasts with general stochastic convex optimization, where $\Omega(1/\varepsilon^4)$ iterations are needed Amir et al. 2021. The lower iteration complexity is ensured by leveraging uniform convergence rather than stability. But instead of uniform convergence in a norm ball, which we show can guarantee suboptimal learning using $\Theta(1/\varepsilon^4)$ samples, we rely on uniform convergence in a distribution-dependent ball.


Optimal Learning for Multi-pass Stochastic Gradient Methods

Neural Information Processing Systems

We analyze the learning properties of the stochastic gradient method when multiple passes over the data and mini-batches are allowed. In particular, we consider the square loss and show that for a universal step-size choice, the number of passes acts as a regularization parameter, and optimal finite sample bounds can be achieved by early-stopping. Moreover, we show that larger step-sizes are allowed when considering mini-batches. Our analysis is based on a unifying approach, encompassing both batch and stochastic gradient methods as special cases.


Rethinking Memorization Measures and their Implications in Large Language Models

Ghosh, Bishwamittra, Das, Soumi, Wu, Qinyuan, Khan, Mohammad Aflah, Gummadi, Krishna P., Terzi, Evimaria, Garg, Deepak

arXiv.org Artificial Intelligence

Concerned with privacy threats, memorization in LLMs is often seen as undesirable, specifically for learning. In this paper, we study whether memorization can be avoided when optimally learning a language, and whether the privacy threat posed by memorization is exaggerated or not. To this end, we re-examine existing privacy-focused measures of memorization, namely recollection-based and counterfactual memorization, along with a newly proposed contextual memorization. Relating memorization to local over-fitting during learning, contextual memorization aims to disentangle memorization from the contextual learning ability of LLMs. Informally, a string is contextually memorized if its recollection due to training exceeds the optimal contextual recollection, a learned threshold denoting the best contextual learning without training. Conceptually, contextual recollection avoids the fallacy of recollection-based memorization, where any form of high recollection is a sign of memorization. Theoretically, contextual memorization relates to counterfactual memorization, but imposes stronger conditions. Memorization measures differ in outcomes and information requirements. Experimenting on 18 LLMs from 6 families and multiple formal languages of different entropy, we show that (a) memorization measures disagree on memorization order of varying frequent strings, (b) optimal learning of a language cannot avoid partial memorization of training strings, and (c) improved learning decreases contextual and counterfactual memorization but increases recollection-based memorization. Finally, (d) we revisit existing reports of memorized strings by recollection that neither pose a privacy threat nor are contextually or counterfactually memorized.


Review for NeurIPS paper: Optimal Learning from Verified Training Data

Neural Information Processing Systems

Summary and Contributions: This paper considers a problem of linear regression where some of your inputs are strategically corrupted. In particular, the algorithm is given some vector data x, which it uses to approximate y by w.x for some learned value w. However, an adversary can corrupt the value of x to x in an attempt to make the predicted value closer to z. Given a training set of triples (x,y,z) the objective is to learn a w that does as well as possible despite these corruptions. Formally, the adversary sends the algorithm x that minimizes w.x -z 2 gamma x-x 2. In particular, they try to minimize the error between the prediction and x without making x too far from z.


Review for NeurIPS paper: Optimal Learning from Verified Training Data

Neural Information Processing Systems

This paper focuses on solving least squares regression problem in the setting where some of the inputs are strategically corrupted. The authors show that certain special setting of the general problem, there exists an algorithm that achieves a provably optimal solution. Overall, the results are interesting and target an improved problems. The main concerns are regarding the significance of the results (as the setting targeted here seems to be fairly narrow) and providing a proper context wrt previous work (this hopefully can be remedied in the revision).


Reviews: Optimal Learning for Multi-pass Stochastic Gradient Methods

Neural Information Processing Systems

This work provides a strong contribution in that it apparently is the first work to net optimal rates (up to log factors) for SGM, and moreover, it also handles a mini-batch analysis which includes the (full) batch method as a special case. Such rates previously had been established only for the (batch) ridge regression method. My interpretation of what all the results actually show is given in the Summary. I find the current solution of relying on cross-validation for adaptation to be a bit of an inelegant cop-out (even if there is a theoretically-supported method for using it); given that several of your corollaries provide a guarantee where \zeta and \gamma enter the picture only through T *, can you provide a self-monitoring method that decides when to stop? In particular, I find the most exciting results to be Corollaries 3.3 and 3.9, as only the stopping time depends on the (unknown) capacity parameters, and so such an online stopping mechanism might be possible.


On Optimal Learning Under Targeted Data Poisoning

Neural Information Processing Systems

Consider the task of learning a hypothesis class \mathcal{H} in the presence of an adversary that can replace up to an \eta fraction of the examples in the training set with arbitrary adversarial examples. The adversary aims to fail the learner on a particular target test point x which is \emph{known} to the adversary but not to the learner. In this work we aim to characterize the smallest achievable error \epsilon \epsilon(\eta) by the learner in the presence of such an adversary in both realizable and agnostic settings. Remarkably, we show that the upper bound can be attained by a deterministic learner. In the agnostic setting we reveal a more elaborate landscape: we devise a deterministic learner with a multiplicative regret guarantee of \epsilon \leq C\cdot\mathtt{OPT} O(\mathtt{VC}(\mathcal{H})\cdot \eta), where C 1 is a universal numerical constant.


Thinking Outside the Ball: Optimal Learning with Gradient Descent for Generalized Linear Stochastic Convex Optimization

Neural Information Processing Systems

We consider linear prediction with a convex Lipschitz loss, or more generally, stochastic convex optimization problems of generalized linear form, i.e. We show that in this setting, early stopped Gradient Descent (GD), without any explicit regularization or projection, ensures excess error at most \varepsilon (compared to the best possible with unit Euclidean norm) with an optimal, up to logarithmic factors, sample complexity of \tilde{O}(1/\varepsilon 2) and only \tilde{O}(1/\varepsilon 2) iterations. This contrasts with general stochastic convex optimization, where \Omega(1/\varepsilon 4) iterations are needed Amir et al. 2021. The lower iteration complexity is ensured by leveraging uniform convergence rather than stability. But instead of uniform convergence in a norm ball, which we show can guarantee suboptimal learning using \Theta(1/\varepsilon 4) samples, we rely on uniform convergence in a distribution-dependent ball.


Optimal Learning from Verified Training Data

Neural Information Processing Systems

Standard machine learning algorithms typically assume that data is sampled independently from the distribution of interest. In attempts to relax this assumption, fields such as adversarial learning typically assume that data is provided by an adversary, whose sole objective is to fool a learning algorithm. However, in reality, it is often the case that data comes from self-interested agents, with less malicious goals and intentions which lie somewhere between the two settings described above. To tackle this problem, we present a Stackelberg competition model for least squares regression, in which data is provided by agents who wish to achieve specific predictions for their data. Although the resulting optimisation problem is nonconvex, we derive an algorithm which converges globally, outperforming current approaches which only guarantee convergence to local optima.