Goto

Collaborating Authors

 Performance Analysis



On the Effectiveness of Lipschitz-Driven Rehearsal in Continual Learning

Neural Information Processing Systems

Rehearsal approaches enjoy immense popularity with Continual Learning (CL) practitioners. These methods collect samples from previously encountered data distributions in a small memory buffer; subsequently, they repeatedly optimize on the latter to prevent catastrophic forgetting. This work draws attention to a hidden pitfall of this widespread practice: repeated optimization on a small pool of data inevitably leads to tight and unstable decision boundaries, which are a major hindrance to generalization. To address this issue, we propose LipschitzDrivEn Rehearsal (LiDER), a surrogate objective that induces smoothness in the backbone network by constraining its layer-wise Lipschitz constants w.r.t.


Appendix

Neural Information Processing Systems

We provide concrete rules below for the two competition tracks that comprise DATACOMP: filtering and BYOD . Additionally, we provide a checklist, which encourages participants to specify design decisions, which allows for more granular comparison between submissions. A.1 Filtering track rules Participants can enter submissions for one or many different scales: small, medium, large or xlarge, which represent the raw number of image-text pairs in CommonPool that should be filtered. After choosing a scale, participants generate a list of uids, where each uid refers to a COMMONPOOL sample. The list of uids is used to recover image-text pairs from the pool, which is used for downstream CLIP training.



Can we globally optimize cross validation loss in ridge regression

Neural Information Processing Systems

Models like LASSO and ridge regression are extensively used in practice due to their interpretability, ease of use, and strong theoretical guarantees. Crossvalidation (CV) is widely used for hyperparameter tuning in these models, but do practical optimization methods minimize the true out-of-sample loss? A recent line of research promises to show that the optimum of the CV loss matches the optimum of the out-of-sample loss (possibly after simple corrections). It remains to show how tractable it is to minimize the CV loss. In the present paper, we show that, in the case of ridge regression, the CV loss may fail to be quasiconvex and thus may have multiple local optima. We can guarantee that the CV loss is quasiconvex in at least one case: when the spectrum of the covariate matrix is nearly flat and the noise in the observed responses is not too high. More generally, we show that quasiconvexity status is independent of many properties of the observed data (response norm, covariate-matrix right singular vectors, and singular-value scaling) and has a complex dependence on the few that remain. We empirically confirm our theory using simulated experiments.



Cycle Self-Training for Domain Adaptation

Neural Information Processing Systems

Mainstream approaches for unsupervised domain adaptation (UDA) learn domaininvariant representations to narrow the domain shift, which are empirically effective but theoretically challenged by the hardness or impossibility theorems. Recently, self-training has been gaining momentum in UDA, which exploits unlabeled target data by training with target pseudo-labels. However, as corroborated in this work, under distributional shift, the pseudo-labels can be unreliable in terms of their large discrepancy from target ground truth. In this paper, we propose Cycle Self-Training (CST), a principled self-training algorithm that explicitly enforces pseudo-labels to generalize across domains.


6739d8df16b5bce3587ca5f18662a6aa-Supplemental-Conference.pdf

Neural Information Processing Systems

Here we provide proofs of the statements made in the main text as well as further figures of numerical experiments and a more detailed discussion of heteroskedasticity effects regarding causal discovery. Let (Xi,Yi)i=1,...,n be an independent sample with Pearson correlation coefficient ρ, and we assume the linear model Yi = Xiβ +h(Zi)ϵi, where Zi and ϵi are independent and standard normal, and his the noise scaling function. Z. Testing whether the Pearson correlation between X and Y is zero is equivalent to testing whether the slope parameter β is equal to zero. Therefore, this is a homoskedastic problem. A.1.2 Discussion of Effect 2: We start by discussing the homoskedastic case to see where non-constant variance of noise leads to problems within the t-test.



Supplementary material to Generalization Error Rates in Kernel Ridge Regression The Crossover from the Noiseless to Noisy Regime of the decays

Neural Information Processing Systems

A.1 Equations for Gaussian design In this Appendix we discuss the derivation of eqs. Exact asymptotic formulas for the excess prediction error of least-squares and ridge regression are a classic result in high-dimensional statistics, and have been derived in many different works [23, 32, 52, 53]. In this manuscript, we follow the presentation given in [25], which is particularly adapted to our derivation and has the advantage to hold rigorously at large but finite number of samples nand features p. We start by reviewing the formulas in [25]. Note that the risk considered in eq.