Goto

Collaborating Authors

 distinguishability


On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts

Neural Information Processing Systems

The softmax-contaminated mixture of experts (MoE) model is deployed when a large-scale pre-trained model, which plays the role of a fixed expert, is fine-tuned for learning downstream tasks by including a new contamination part, or prompt, functioning as a new, trainable expert. Despite its popularity and relevance, the theoretical properties of the softmax-contaminated MoE have remained unexplored in the literature. In the paper, we study the convergence rates of the maximum likelihood estimator of gating and prompt parameters in order to gain insights into the statistical properties and potential challenges of fine-tuning with a new prompt. We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model, in the sense that we make precise by formulating a novel analytic notion of distinguishability. Under distinguishability of the pre-trained and prompt models, we derive minimax optimal estimation rates for all the gating and prompt parameters. By contrast, when the distinguishability condition is violated, these estimation rates become significantly slower due to their dependence on the prompt convergence rate to the pre-trained model. Finally, we empirically corroborate our theoretical findings through several numerical experiments.


Observable Geometry of Singular Statistical Models

arXiv.org Machine Learning

Singular statistical models arise whenever different parameter values induce the same distribution, leading to non-identifiability and a breakdown of classical asymptotic theory. While existing approaches analyze these phenomena in parameter space, the resulting descriptions depend heavily on parameterization and obscure the intrinsic statistical structure of the model. In this paper, we introduce an invariant framework based on \emph{observable charts}: collections of functionals of the data distribution that distinguish probability measures. These charts define local coordinate systems directly on the model space, independent of parameterization. We formalize \emph{observable completeness} as the ability of such charts to detect identifiable directions, and introduce \emph{observable order} to quantify higher-order distinguishability along analytic perturbations. Our main result establishes that, under mild regularity conditions, observable order provides a lower bound on the rate at which Kullback-Leibler divergence vanishes along analytic paths. This connects intrinsic geometric structure in model space to statistical distinguishability and recovers classical behavior in regular models while extending naturally to singular settings. We illustrate the framework in reduced-rank regression and Gaussian mixture models, where observable coordinates reveal both identifiable structure and singular degeneracies. These results suggest that observable charts provide a unified and parameterization-invariant language for studying singular models and offer a pathway toward intrinsic formulations of invariants such as learning coefficients.



3b54ff26ae928fb2f111198c75f6a7e3-Paper-Conference.pdf

Neural Information Processing Systems

An alternative approach, Generative Adversarial Networks (GANs), has become popular across severaldomains, particularly Computer Vision, owing tobreakthrough realism intheimages they output[e.g.,19,65]. This is the case in NLP where, unlike computer vision, a measure of likelihood called perplexityhas been theprevailing metric fortraining and evaluating language models fordecades.


Are GANs overkill for NLP?

Neural Information Processing Systems

This work offers a novel theoretical perspective on why, despite numerous attempts, adversarial approaches to generative modeling (e.g., GANs) have not been as successful for certain generation tasks, particularly sequential tasks such as Natural Language Generation, as they have in others, such as Computer Vision. In particular, on sequential data such as text, maximum-likelihood approaches are significantly more utilized than GANs. We show that, while it may seem that maximizing likelihood is inherently different than minimizing distinguishability, this distinction is largely an artifact of the limited representational capacity of the model family, for a wide class of adversarial objectives. We give a theoretical model in which minimizing KL-divergence (i.e., maximizing likelihood) is a more efficient approach to effectively minimizing the same distinguishability criteria that adversarial models seek to optimize. Reductions show that minimizing distinguishability can be seen as simply boosting likelihood for certain families of models including n-gram models and neural networks with a softmax output layer. To achieve a full polynomial-time reduction, a novel next-token distinguishability model is considered. Some preliminary empirical evidence is also provided to substantiate our theoretical analyses.



RDD: Pareto Analysis of the Rate-Distortion-Distinguishability Trade-off

arXiv.org Artificial Intelligence

Extensive monitoring systems generate data that is usually compressed for network transmission. This compressed data might then be processed in the cloud for tasks such as anomaly detection. However, compression can potentially impair the detector's ability to distinguish between regular and irregular patterns due to information loss. Here we extend the information-theoretic framework introduced in [1] to simultaneously address the trade-off between the three features on which the effectiveness of the system depends: the effectiveness of compression, the amount of distortion it introduces, and the distinguishability between compressed normal signals and compressed anomalous signals. We leverage a Gaussian assumption to draw curves showing how moving on a Pareto surface helps administer such a trade-off better than simply relying on optimal rate-distortion compression and hoping that compressed signals can be distinguished from each other.


Natural Fingerprints of Large Language Models

arXiv.org Artificial Intelligence

Recent studies have shown that the outputs from large language models (LLMs) can often reveal the identity of their source model. While this is a natural consequence of LLMs modeling the distribution of their training data, such identifiable traces may also reflect unintended characteristics with potential implications for fairness and misuse. In this work, we go one step further and show that even when LLMs are trained on exactly the same dataset, their outputs remain distinguishable, suggesting that training dynamics alone can leave recognizable patterns. We refer to these unintended, distinctive characteristics as natural fingerprints. By systematically controlling training conditions, we show that the natural fingerprints can emerge from subtle differences in the training process, such as parameter sizes, optimization settings, and even random seeds. These results suggest that training dynamics can systematically shape model behavior, independent of data or architecture, and should be explicitly considered in future research on transparency, reliability, and interpretability.



When GNNs meet symmetry in ILPs: an orbit-based feature augmentation approach

arXiv.org Artificial Intelligence

A common characteristic in integer linear programs (ILPs) is symmetry, allowing variables to be permuted without altering the underlying problem structure. Recently, GNNs have emerged as a promising approach for solving ILPs. However, a significant challenge arises when applying GNNs to ILPs with symmetry: classic GNN architectures struggle to differentiate between symmetric variables, which limits their predictive accuracy. In this work, we investigate the properties of permutation equivariance and invariance in GNNs, particularly in relation to the inherent symmetry of ILP formulations. We reveal that the interaction between these two factors contributes to the difficulty of distinguishing between symmetric variables. To address this challenge, we explore the potential of feature augmentation and propose several guiding principles for constructing augmented features. Building on these principles, we develop an orbit-based augmentation scheme that first groups symmetric variables and then samples augmented features for each group from a discrete uniform distribution. Empirical results demonstrate that our proposed approach significantly enhances both training efficiency and predictive performance. Integer Linear Programs (ILPs) are fundamental optimization problems characterized by a linear objective function and linear constraints, where the decision variables are restricted to integer values. These problems play a critical role in various fields, including operations research, computer science, and engineering (Pochet & Wolsey, 2006; Liu & Fan, 2018; Watson & Woodruff, 2011; Luathep et al., 2011; Schöbel, 2001).