LearningandTransferringSparseContextualBigrams withLinearTransformers
–Neural Information Processing Systems
Weshowthat when trained from scratch,thetraining process can be split into an initial sample-intensive stage where the correlation is boosted from zero to a nontrivial value, followed by a more sample-efficient stageoffurther improvement. Additionally,weprovethat, provided anontrivial correlation between the downstream and pretraining tasks, finetuning from a pretrained model allowsustobypass the initial sample-intensivestage.
Neural Information Processing Systems
Feb-9-2026, 13:26:40 GMT