Goto

Collaborating Authors

 span


Coupled Training with Privileged Information and Unlabeled Data

arXiv.org Machine Learning

In many prediction problems, we have extra information during training (for example, measurements that are expensive or slow to collect) that will not be available when the model is deployed. A common strategy is to first train a model that uses all training information, then use its predictions on unlabeled examples to train a second model that only uses the inputs available at test time. However, when the extra training-only information is weak or noisy, this Two-Stage approach can mislead the deployment model and even hurt accuracy. We propose a joint training method that learns the two models together, so the deployment model can benefit from the extra information only when it actually helps, instead of inheriting its mistakes. We provide guarantees that describe when joint training improves prediction accuracy and analyze a simple alternating training algorithm for large, high-dimensional models. Experiments on synthetic data and real-world prediction tasks show that our approach avoids these failures and robustly outperforms standard Two-Stage baselines.


Axiomatizing Neural Networks via Pursuit of Subspaces

arXiv.org Machine Learning

While deep neural networks have achieved remarkable success across a wide range of domains, their underlying mechanisms remain poorly understood, and they are often regarded as black boxes. This gap between empirical performance and theoretical understanding poses a challenge analogous to the pre-axiomatic stage of classical geometry. In this work, we introduce the Pursuit of Subspaces (PoS) hypothesis, an axiomatic framework that formulates neural network behavior through a set of geometric postulates. These axioms, together with their derived consequences, provide a unified perspective on representation, computation, and generalization in both shallow and deep architectures. We show that this framework yields geometric explanations for fundamental questions in deep learning, including representation structure, architectural mechanisms, and generalization behavior, offering a principled step toward a coherent theoretical foundation.


On the Limits of Latent Reuse in Diffusion Models

arXiv.org Machine Learning

Diffusion models are often trained in low-dimensional latent spaces, which are then reused for related but shifted datasets. In this work, we study when such latent reuse remains reliable under distribution shift. We consider a source-target setting in which both datasets are approximately low-dimensional but may lie near different subspaces. We show that freezing and reusing a source latent space induces a target-domain score error governed by two quantities: the principal-angle misalignment between the source and target subspaces, and the target ambient noise amplified by the diffusion time scale. Motivated by these limits, we further study mixed source-target training and characterize how the required shared latent dimension depends on the relative geometry of the two distributions. Our results provide theoretical guidance on when latent reuse is reliable and when learning a shared representation may be necessary.


Locally Near Optimal Piecewise Linear Regression in High Dimensions via Difference of Max-Affine Functions

arXiv.org Machine Learning

This paper presents a parametric solution to piecewise linear regression through the Adaptive Block Gradient Descent (ABGD) algorithm. The heart of the method is the parametrization of piecewise linear functions as the difference of max-affine (DoMA) functions. A non-asymptotic local convergence analysis for ABGD is provided under sub-Gaussian covariate and noise distributions. To initialize ABGD, we adapt a prior algorithm originally developed for the simpler setting of max-affine functions. When suitably initialized, ABGD converges linearly to an $ε$-accurate estimate given $\tilde{\mathcal{O}}(d\max(σ_z/ε,1)^2)$ observations where $σ_z^2$ denotes the noise variance. This implies exact recovery given $\tilde{\mathcal{O}}(d)$ samples in the noiseless case. Also, such a rate is shown to be minimax optimal up to logarithmic factors. Synthetic numerical results corroborate the theoretical guarantees for ABGD. We also observe competitive performance compared to the state-of-the-art methods on real-world datasets.


Fair Graph Distillation

Neural Information Processing Systems

As graph neural networks (GNNs) struggle with large-scale graphs due to high computational demands, graph data distillation promises to alleviate this issue by distilling a large real graph into a smaller distilled graph while maintaining comparable prediction performance for GNNs trained on both graphs. However, we observe that GNNs trained on distilled graphs may exhibit more severe group fairness issues than GNNs trained on real graphs for vanilla and fair GNNs training. Motivated by these observations, we propose fair graph distillation (FGD), an advanced graph distillation approach to generate fair distilled graphs. The challenge lies in the deficiency of sensitive attributes for nodes in the distilled graph, making most debiasing methods (e.g., regularization and adversarial debiasing) intractable for distilled graphs. We develop a simple yet effective bias metric, named coherence, for distilled graphs. Based on the proposed coherence metric, we introduce a framework for fair graph distillation using a bi-level optimization algorithm. Extensive experiments demonstrate that the proposed algorithm can achieve better prediction performance-fairness trade-offs across various datasets and GNN architectures.





One for All: Simultaneous Metric and Preference Learning over Multiple Users

Neural Information Processing Systems

This paper investigates simultaneous preference and metric learning from a crowd of respondents. A set of items represented by d-dimensional feature vectors and paired comparisons of the form "item i is preferable to item j" made by each user is given. Our model jointly learns a distance metric that characterizes the crowd's general measure of item similarities along with a latent ideal point for each user reflecting their individual preferences. This model has the flexibility to capture individual preferences, while enjoying a metric learning sample cost that is amortized over the crowd. We first study this problem in a noiseless, continuous response setting (i.e., responses equal to differences of item distances) to understand the fundamental limits of learning. Next, we establish prediction error guarantees for noisy, binary measurements such as may be collected from human respondents, and show how the sample complexity improves when the underlying metric is lowrank. Finally, we establish recovery guarantees under assumptions on the response distribution. We demonstrate the performance of our model on both simulated data and on a dataset of color preference judgments across a large number of users.


Invariance . the Initialized

Neural Information Processing Systems

In this paper, we analyze neural networks trained on high-dimensional data that lies on a low dimen-441 sional linear subspace denoted by P. We assume that the dimension of P is d ℓ. Throughout the pa-442 per it will be more convenient to analyze data which lies on the subspace M = span({e1,...,ed ℓ}),443 because then the "off manifold" directions correspond exactly to certain coordinates of the input. In444 this section we show that we can essentially analyze the data as if it is rotated to lie on M, and it445 would imply the same consequences as the original data from P.446 Theorem A.1. Let P Rd be a subspace of dimension d ℓ, and let M = span{e1,...,ed ℓ}.447 Let R be an orthogonal matrix such that R P = M, let X P be a training dataset and let448 XR = {R x: x X}.