Goto

Collaborating Authors

 condition 1





Personalizing black-box models for nonparametric regression with minimax optimality

Li, Sai, Zhang, Linjun

arXiv.org Machine Learning

Recent advances in large-scale models, including deep neural networks and large language models, have substantially improved performance across a wide range of learning tasks. The widespread availability of such pre-trained models creates new opportunities for data-efficient statistical learning, provided they can be effectively integrated into downstream tasks. Motivated by this setting, we study few-shot personalization, where a pre-trained black-box model is adapted to a target domain using a limited number of samples. We develop a theoretical framework for few-shot personalization in nonparametric regression and propose algorithms that can incorporate a black-box pre-trained model into the regression procedure. We establish the minimax optimal rate for the personalization problem and show that the proposed method attains this rate. Our results clarify the statistical benefits of leveraging pre-trained models under sample scarcity and provide robustness guarantees when the pre-trained model is not informative. We illustrate the finite-sample performance of the methods through simulations and an application to the California housing dataset with several pre-trained models.


Why Most Optimism Bandit Algorithms Have the Same Regret Analysis: A Simple Unifying Theorem

Krishnamurthy, Vikram

arXiv.org Machine Learning

Several optimism-based stochastic bandit algorithms -- including UCB, UCB-V, linear UCB, and finite-arm GP-UCB -- achieve logarithmic regret using proofs that, despite superficial differences, follow essentially the same structure. This note isolates the minimal ingredients behind these analyses: a single high-probability concentration condition on the estimators, after which logarithmic regret follows from two short deterministic lemmas describing radius collapse and optimism-forced deviations. The framework yields unified, near-minimal proofs for these classical algorithms and extends naturally to many contemporary bandit variants.


Towards Understanding Generalization in DP-GD: A Case Study in Training Two-Layer CNNs

Shi, Zhongjie, Wang, Puyu, Zhang, Chenyang, Cao, Yuan

arXiv.org Machine Learning

Modern deep learning techniques focus on extracting intricate information from data to achieve accurate predictions. However, the training datasets may be crowdsourced and include sensitive information, such as personal contact details, financial data, and medical records. As a result, there is a growing emphasis on developing privacy-preserving training algorithms for neural networks that maintain good performance while preserving privacy. In this paper, we investigate the generalization and privacy performances of the differentially private gradient descent (DP-GD) algorithm, which is a private variant of the gradient descent (GD) by incorporating additional noise into the gradients during each iteration. Moreover, we identify a concrete learning task where DP-GD can achieve superior generalization performance compared to GD in training two-layer Huberized ReLU convolutional neural networks (CNNs). Specifically, we demonstrate that, under mild conditions, a small signal-to-noise ratio can result in GD producing training models with poor test accuracy, whereas DP-GD can yield training models with good test accuracy and privacy guarantees if the signal-to-noise ratio is not too small. This indicates that DP-GD has the potential to enhance model performance while ensuring privacy protection in certain learning tasks. Numerical simulations are further conducted to support our theoretical results.


Filtering with Self-Attention and Storing with MLP: One-Layer Transformers Can Provably Acquire and Extract Knowledge

Xu, Ruichen, Chen, Kexin

arXiv.org Artificial Intelligence

Modern large language models (LLMs) demonstrate exceptional performance on knowledge-intensive tasks, yet the theoretical mechanisms underlying knowledge acquisition (storage and memorization) during pre-training and extraction (retrieval and recall) during inference after fine-tuning remain poorly understood. Although prior theoretical studies have explored these processes through analyses of training dynamics, they overlook critical components essential for a comprehensive theory: (1) the multi-layer perceptron (MLP), empirically identified as the primary module for knowledge storage; (2) out-of-distribution (OOD) adaptivity, which enables LLMs to generalize to unseen scenarios post-pre-training; and (3) next-token prediction, the standard autoregressive objective that encodes knowledge as conditional probabilities. In this work, we introduce, to the best of our knowledge, the first theoretical framework that addresses these limitations by examining the training dynamics of one-layer transformers. Under regularity assumptions, we establish that: (i) transformers attain near-optimal training loss during pre-training, demonstrating effective knowledge acquisition; (ii) given a sufficiently large fine-tuning dataset and appropriate data multiplicity conditions, transformers achieve low generalization error on factual knowledge acquired during pre-training but not revisited in fine-tuning, indicating robust knowledge extraction; and (iii) violation of these conditions leads to elevated generalization error, manifesting as hallucinations. Our analysis encompasses both full fine-tuning and low-rank fine-tuning, yielding insights into the efficacy of practical low-rank adaptation methods. We validate our theoretical findings through experiments on synthetic datasets and the real-world PopQA benchmark, employing GPT-2 and Llama-3.2-1B models.


Hierarchical Linkage Clustering Beyond Binary Trees and Ultrametrics

Dreveton, Maximilien, Grossglauser, Matthias, Kuroda, Daichi, Thiran, Patrick

arXiv.org Machine Learning

Hierarchical clustering seeks to uncover nested structures in data by constructing a tree of clusters, where deeper levels reveal finer-grained relationships. Traditional methods, including linkage approaches, face three major limitations: (i) they always return a hierarchy, even if none exists, (ii) they are restricted to binary trees, even if the true hierarchy is non-binary, and (iii) they are highly sensitive to the choice of linkage function. In this paper, we address these issues by introducing the notion of a valid hierarchy and defining a partial order over the set of valid hierarchies. We prove the existence of a finest valid hierarchy, that is, the hierarchy that encodes the maximum information consistent with the similarity structure of the data set. In particular, the finest valid hierarchy is not constrained to binary structures and, when no hierarchical relationships exist, collapses to a star tree. We propose a simple two-step algorithm that first constructs a binary tree via a linkage method and then prunes it to enforce validity. We establish necessary and sufficient conditions on the linkage function under which this procedure exactly recovers the finest valid hierarchy, and we show that all linkage functions satisfying these conditions yield the same hierarchy after pruning. Notably, classical linkage rules such as single, complete, and average satisfy these conditions, whereas Ward's linkage fails to do so.


Active Learning from Imperfect Labelers

Songbai Yan, Kamalika Chaudhuri, Tara Javidi

Neural Information Processing Systems

We propose an algorithm which utilizes abstention responses, and analyze its statistical consistency and query complexity under fairly natural assumptions on the noise and abstention rate of the labeler.