Goto

Collaborating Authors

 Regression


AFaster Training Algorithm for Regression Trees with Linear Leaves, and an Analysis of its Complexity

Neural Information Processing Systems

We consider the Tree Alternating Optimization (TAO) algorithm to train regression trees with linear predictors in the leaves. Unlike the traditional, greedy recursive partitioning algorithms such as CART, TAO guarantees a monotonic decrease of the objective function and results in smaller trees of much better accuracy. We modify the TAO algorithm so that it produces exactly the same result but is much faster, particularly for high input dimensionality or deep trees. The idea is based on the fact that, at each iteration of TAO, each leaf receives only a subset of the training instances. Thus, the optimization of the leaf model can be done exactly but faster by using the Sherman-Morrison-Woodbury formula. This has the unexpected advantage that, once a tree exceeds a critical depth, then making it deeper makes it faster to train, even though the tree is larger and has more parameters. Indeed, this can make learning a nonlinear model (the tree) asymptotically faster than a regular linear regression model. We analyze the corresponding computational complexity and verify the speedups experimentally in various datasets. The argument can be applied to other types of trees, whenever the optimization of a node can be computed in superlinear time of the number of instances.


Variational Task Vector Composition

Neural Information Processing Systems

Task vectors capture how a model changes during fine-tuning by recording the difference between pre-trained and task-specific weights. The composition of task vectors, a key operator in task arithmetic, enables models to integrate knowledge from multiple tasks without incurring significant additional inference costs. In this paper, we propose variational task vector composition (VTVC), where composition coefficients are taken as latent variables and estimated in a Bayesian inference framework. Unlike previous methods that operate at the task level, our framework focuses on sample-specific composition. Motivated by the observation of structural redundancy in task vectors, we introduce a Spike-and-Slab prior that promotes sparsity and aims to preserve the most informative components. To further address the high variance and sampling inefficiency in sparse, high-dimensional spaces, we develop a gated sampling mechanism that constructs a controllable posterior by filtering the composition coefficients based on both uncertainty and importance. This yields a more stable and interpretable variational framework by deterministically selecting reliable task components, reducing sampling variance while improving transparency and generalization. Experimental results demonstrate that our method achieves state-of-the-art average performance across a diverse range of benchmarks, including image classification and natural language understanding.


On the Mechanisms of Weak-to-Strong Generalization: ATheoretical Perspective

Neural Information Processing Systems

Weak-to-strong generalization--where a student model trained on imperfect labels generated by a weaker teacher nonetheless surpasses that teacher--has been widely observed, but the mechanisms that enable it have remained poorly understood. In this paper, through a theoretical analysis of simple models, we uncover three core mechanisms that can drive this phenomenon. First, by analyzing ridge linear regression, we study the interplay between the teacher and student regularization parameters and prove that a student can compensate for a teacher's under-regularization and achieve lower test error. We also analyze the role of the parameterization regime of the models and show that qualitatively different phenomena can happen in different regimes. Second, by analyzing weighted ridge linear regression, we show that a student model with a regularization structure better aligned to the target function, can outperform its teacher. Third, in a nonlinear multi-index learning setting, we demonstrate that a student can learn easy, task-specific features from the teacher while leveraging its own broader pre-training to learn hard-to-learn features that the teacher cannot capture.


Quasi-Self-Concordant Optimization with Lewis Weights

Neural Information Processing Systems

In this paper, we study the problem minx Rd,Nx=v Pn i=1 f((Ax b)i)for a quasiself-concordant function f: R R, where A,N are n d and m d matrices, b,v are vectors of length n and m with n d. We show an algorithm based on a trust-region method with an oracle that can be implemented using eO(d1/3)linear system solves, improving the eO(n1/3) oracle by [Adil-Bullins-Sachdeva, NeurIPS 2021]. Our implementation of the oracle relies on solving the overdetermined ℓ regression problem minx Rd,Nx=v Ax b . We provide an algorithm that finds a (1+ε)-approximate solution to this problem using O((d1/3/ε+1/ε2)log(n/ε)) linear system solves. This algorithm leverages ℓ Lewis weight overestimates and achieves this iteration complexity via a simple lightweight IRLS approach, inspired by the work of [Ene-Vladu, ICML 2019]. Experimentally, we demonstrate that our algorithm significantly improves the runtime of the standard CVX solver.


How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs

Neural Information Processing Systems

Pretrained Transformers demonstrate remarkable in-context learning (ICL) capabilities, enabling them to adapt to new tasks from demonstrations without parameter updates. However, theoretical studies often rely on simplified architectures (e.g., omitting MLPs), plain data models (e.g., linear regression with isotropic inputs), and single-source training--limiting their relevance to realistic settings. In this work, we study ICL in pretrained Transformers with nonlinear MLP heads on nonlinear tasks drawn from multiple data sources with heterogeneous input, task, and noise distributions. We analyze a model where the MLP comprises two layers, with the first layer trained via a single gradient step and the second layer fully optimized. Under high-dimensional asymptotics, we prove that such models are equivalent in ICL error to structured polynomial predictors, leveraging results from the theory of Gaussian universality and orthogonal polynomials. This equivalence reveals that nonlinear MLPs meaningfully enhance ICL performance--particularly on nonlinear tasks--compared to linear baselines.


Sample-efficient Learning of Concepts with Theoretical Guarantees: from Data to Concepts without Interventions

Neural Information Processing Systems

Machine learning is a vital part of many real-world systems, but concerns remain about the lack of interpretability, explainability and robustness of black-box AI systems. Concept Bottleneck Models (CBM) address some of these challenges by learning interpretable concepts from high-dimensional data, e.g.



Pessimistic Data Integration for Policy Evaluation

Neural Information Processing Systems

This paper studies how to integrate historical control data with experimental data to enhance A/B testing, while addressing the distributional shift between historical and experimental datasets. We propose a pessimistic data integration method that combines two causal effect estimators constructed based on experimental and historical datasets. Our main idea is to conceptualize the weight function for this combination as a policy so that existing pessimistic policy learning algorithms are applicable to learn the optimal weight that minimizes the resulting weighted estimator's mean squared error. Additionally, we conduct comprehensive theoretical and empirical analyses to compare our method against various baseline estimators across five scenarios. Both our theoretical and numerical findings demonstrate that the proposed estimator achieves near-optimal performance across all scenarios.


Robust Estimation Under Heterogeneous Corruption Rates Syomantak Chaudhuri University of California, Berkeley Jerry Li University of Washington Thomas A. Courtade University of California, Berkeley

Neural Information Processing Systems

We study the problem of robust estimation under heterogeneous corruption rates, where each sample may be independently corrupted with a known but non-identical probability. This setting arises naturally in distributed and federated learning, crowdsourcing, and sensor networks, yet existing robust estimators typically assume uniform or worst-case corruption, ignoring structural heterogeneity. For mean estimation for multivariate bounded distributions and univariate gaussian distributions, we give tight minimax rates for all heterogeneous corruption patterns. For multivariate gaussian mean estimation and linear regression, we establish the minimax rate for squared error up to a factor of d, where d is the dimension. Roughly, our findings suggest that samples beyond a certain corruption threshold may be discarded by the optimal estimators - this threshold is determined by the empirical distribution of the corruption rates given.


Large Stepsizes Accelerate Gradient Descent for Regularized Logistic Regression

Neural Information Processing Systems

We study gradient descent (GD) with a constant stepsize for ℓ2-regularized logistic regression with linearly separable data. Classical theory suggests small stepsizes to ensure monotonic reduction of the optimization objective, achieving exponential convergence in eO(κ) steps with κ being the condition number. Surprisingly, we show that this can be accelerated to eO( κ)by simply using a large stepsize--for which the objective evolves nonmonotonically. The acceleration brought by large stepsizes extends to minimizing the population risk for separable distributions, improving on the best-known upper bounds on the number of steps to reach a nearoptimum. Finally, we characterize the largest stepsize for the local convergence of GD, which also determines the global convergence in special scenarios. Our results extend the analysis of Wu et al. (2024) from convex settings with minimizers at infinity to strongly convex cases with finite minimizers.