mle
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- (16 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany > Hesse > Darmstadt Region > Wiesbaden (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (3 more...)
A Proofs of Linear Case Throughout the appendix, for ease of notation, we overload the definition of the function d
The proof of this lemma requires Lemma A.1, which characterizes the distribution of the residual By Pinsker's inequality, this implies d By Lemma A.1, we have E[ X ( null w w The proof is inspired by Theorem 11.2 in [20], with modifications to our setting. First, we construct a "ghost" dataset The most challenging aspect of the ReLU setting is that we do not have an expression for the TV suffered by the MLE, such as Lemma 4.2 in the linear case. The proof of this Lemma, as well as other Lemmas in this section, can be found in Appendix B.1. Using Lemma B.2 and Lemma B.3, we can form a uniform bound, such that all A straight forward combination of Lemma 4.3 and Lemma B.4 gives the following Theorem. Now we can apply Bernstein's inequality (Theorem 2.10 of [8]).
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Multinomial Logistic Regression: Asymptotic Normality on Null Covariates in High-Dimensions
This paper investigates the asymptotic distribution of the maximum-likelihood estimate (MLE) in multinomial logistic models in the high-dimensional regime where dimension and sample size are of the same order. While classical large-sample theory provides asymptotic normality of the MLE under certain conditions, such classical results are expected to fail in high-dimensions as documented for the binary logistic case in the seminal work of Sur and Candès [2019]. We address this issue in classification problems with 3 or more classes, by developing asymptotic normality and asymptotic chi-square results for the multinomial logistic MLE (also known as cross-entropy minimizer) on null covariates. Our theory leads to a new methodology to test the significance of a given feature. Extensive simulation studies on synthetic data corroborate these asymptotic results and confirm the validity of proposed p-values for testing the significance of a given feature.
ColdGANs: Taming Language GANs with Cautious Sampling Strategies
Training regimes based on Maximum Likelihood Estimation (MLE) suffer from known limitations, often leading to poorly generated text sequences that lack of coherence, factualness, and are prone to repetitions. At the root of these limitations is the mismatch between training and inference, i.e. the so-called exposure bias. Another problem lies in considering only the reference text as correct, while in practice several alternative formulations could be as good. Generative Adversarial Networks (GANs) could mitigate those limitations. Nonetheless, the discrete nature of text has hindered their application to language generation: the approaches proposed so far, based on Reinforcement Learning, have been shown to under-perform MLE.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.59)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.59)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.59)
Beyond MLE: Convex Learning for Text Generation
Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution that best explain the observed data. In the context of text generation, MLE is often used to train generative language models, which can then be used to generate new text. However, we argue that MLE is not always necessary and optimal, especially for closed-ended text generation tasks like machine translation. In these tasks, the goal of model is to generate the most appropriate response, which does not necessarily require it to estimate the entire data distribution with MLE. To this end, we propose a novel class of training objectives based on convex functions, which enables text generation models to focus on highly probable outputs without having to estimate the entire data distribution. We investigate the theoretical properties of the optimal predicted distribution when applying convex functions to the loss, demonstrating that convex functions can sharpen the optimal distribution, thereby enabling the model to better capture outputs with high probabilities. Experiments on various text generation tasks and models show the effectiveness of our approach. It enables autoregressive models to bridge the gap between greedy and beam search, and facilitates the learning of non-autoregressive models with a maximum improvement of 9+ BLEU points. Moreover, our approach also exhibits significant impact on large language models (LLMs), substantially enhancing their generative capability on various tasks.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.59)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.59)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.59)
On Learning-Curve Monotonicity for Maximum Likelihood Estimators
The property of learning-curve monotonicity, highlighted in a recent series of work by Loog, Mey and Viering, describes algorithms which only improve in average performance given more data, for any underlying data distribution within a given family. We establish the first nontrivial monotonicity guarantees for the maximum likelihood estimator in a variety of well-specified parametric settings. For sequential prediction with log loss, we show monotonicity (in fact complete monotonicity) of the forward KL divergence for Gaussian vectors with unknown covariance and either known or unknown mean, as well as for Gamma variables with unknown scale parameter. The Gaussian setting was explicitly highlighted as open in the aforementioned works, even in dimension 1. Finally we observe that for reverse KL divergence, a folklore trick yields monotonicity for very general exponential families. All results in this paper were derived by variants of GPT-5.2 Pro. Humans did not provide any proof strategies or intermediate arguments, but only prompted the model to continue developing additional results, and verified and transcribed its proofs.
- North America > United States (0.14)
- Asia > Japan > Honshū > Chūbu > Aichi Prefecture > Nagoya (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.61)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.61)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)
- (3 more...)
Matching correlated VAR time series
We study the problem of matching correlated VAR time series databases, where a multivariate time series is observed along with a perturbed and permuted version, and the goal is to recover the unknown matching between them. To model this, we introduce a probabilistic framework in which two time series $(x_t)_{t\in[T]},(x^\#_t)_{t\in[T]}$ are jointly generated, such that $x^\#_t=x_{π^*(t)}+σ\tilde{x}_{π^*(t)}$, where $(x_t)_{t\in[T]},(\tilde{x}_t)_{t\in[T]}$ are independent and identically distributed vector autoregressive (VAR) time series of order $1$ with Gaussian increments, for a hidden $π^*$. The objective is to recover $π^*$, from the observation of $(x_t)_{t\in[T]},(x^\#_t)_{t\in[T]}$. This generalizes the classical problem of matching independent point clouds to the time series setting. We derive the maximum likelihood estimator (MLE), leading to a quadratic optimization over permutations, and theoretically analyze an estimator based on linear assignment. For the latter approach, we establish recovery guarantees, identifying thresholds for $σ$ that allow for perfect or partial recovery. Additionally, we propose solving the MLE by considering convex relaxations of the set of permutation matrices (e.g., over the Birkhoff polytope). This allows for efficient estimation of $π^*$ and the VAR parameters via alternating minimization. Empirically, we find that linear assignment often matches or outperforms MLE relaxation based approaches.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Singapore (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)