AITopics | logistic loss

We consider the parameter estimation problem in logistic regression with Gaussian design: the estimation of a fixed unknown parameter $θ^*\in \mathbb{R}^d$ ($\|θ^*\|_2\ge 1$) from $n$ i.i.d. samples $\{(x_i,y_i)\}_{i=1}^n$, where $x_i\sim N(0,I_d)$ and $y_i|x_i \sim {\rm Bernoulli}(1/(1+\exp(-x_i^\top θ^*)))$. Our main aim is to characterize the finite-sample estimation performance and convergence behavior of gradient descent (GD) on the maximum likelihood objective (i.e., the logistic loss). Under small $O(1)$ stepsize and $0$ initialization, we show that GD linearly converges to a small neighborhood of $θ^*$ achieving an $\ell_2$ error of order $O(\sqrt{\|θ^*\|_2^5d/n})$. This substantially goes beyond existing theoretical results that lack non-asymptotic estimation error rate and exhibit much slower parameter convergence. We also establish a faster local linear convergence to the same statistical error under a large $Θ(\|θ^*\|_2)$ stepsize. The main technical component is to show that the gradient of the logistic loss satisfies a certain approximate invertibility condition (AIC). To that end, we uniformly control the deviation of the gradient from its population counterpart by covering and peeling arguments, and then show that the population GD is a contraction by a delicate analysis based on the eigenvalues of population Hessian matrices. Finally, we build upon the recent work Matsumoto and Mazumdar (2025) and devise a novel efficient estimator that attains a sharper rate in high dimensions. This indicates that the existing non-asymptotic guarantees exhibit sub-optimal dependence on $\|θ^*\|_2$, and that in many regimes $Θ(\sqrt{\|θ^*\|_2d/n})$ is the tight estimation error rate. Numerical examples are provided to corroborate our theoretical results.

artificial intelligence, error rate, machine learning, (17 more...)

arXiv.org Machine Learning

2606.21683

Genre:

Research Report > New Finding (0.71)
Research Report > Experimental Study (0.61)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Add feedback

Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

Boudart, Pierre, Gaillard, Pierre, Rudi, Alessandro

arXiv.org Machine LearningMay-20-2026

We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant $\barσ\_T \leq 1/2$, measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of $\smash{\tilde{O}(dH^2\barσ\_T\sqrt{T})}$, which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $\barσ\_T = O(H^{-1})$, reducing the horizon dependence by a factor $H$. We further establish a matching $\smash{Ω(dH^2\barσ\_T\sqrt{T})}$ lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2605.19768

Country: Europe > France (0.46)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

On Regularizing Rademacher Observation Losses

Richard Nock

Neural Information Processing SystemsMay-1-2026, 05:57:00 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, rado loss, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

eb189151ced0ff808abafd16a51fec92-Paper-Conference.pdf

Neural Information Processing SystemsApr-30-2026, 04:41:06 GMT

artificial intelligence, machine learning, stepsize, (16 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

c74214a3877c4d8297ac96217d5189b7-Supplemental.pdf

Neural Information Processing SystemsApr-27-2026, 00:28:43 GMT

algorithm, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Country: Europe > France (0.28)

Genre: Research Report > New Finding (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Mixability made efficient: Fast online multiclass logistic regression

Neural Information Processing SystemsApr-27-2026, 00:28:39 GMT

Mixability has been shown to be a powerful tool to obtain algorithms with optimal regret. However, the resulting methods often suffer from high computational complexity which has reduced their practical applicability. For example, in the case of multiclass logistic regression, the aggregating forecaster (Foster et al. (2018)) achieves a regret of O(log(Bn)) whereas Online Newton Step achieves O(eBlog(n)) obtaining a double exponential gain in B (a bound on the norm of comparative functions). However, this high statistical performance is at the price of a prohibitive computational complexity O(n37). In this paper, we use quadratic surrogates to make aggregating forecasters more efficient. We show that the resulting algorithm has still high statistical performance for a large class of losses. In particular, we derive an algorithm for multi-class logistic regression with a regret bounded by O(Blog(n)) and a computational complexity of only O(n4).

algorithm, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Country: Europe > France (0.29)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.93)

Add feedback

47a5feca4ce02883a5643e295c7ce6cd-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 17:12:31 GMT

artificial intelligence, isotropic, machine learning, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

47a5feca4ce02883a5643e295c7ce6cd-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 17:12:27 GMT

artificial intelligence, arxiv preprint arxiv, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States (0.28)
Europe (0.28)

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.97)

Add feedback

Invariance . the Initialized

Neural Information Processing SystemsApr-24-2026, 22:46:25 GMT

In this paper, we analyze neural networks trained on high-dimensional data that lies on a low dimen-441 sional linear subspace denoted by P. We assume that the dimension of P is d ℓ. Throughout the pa-442 per it will be more convenient to analyze data which lies on the subspace M = span({e1,...,ed ℓ}),443 because then the "off manifold" directions correspond exactly to certain coordinates of the input. In444 this section we show that we can essentially analyze the data as if it is rotated to lie on M, and it445 would imply the same consequences as the original data from P.446 Theorem A.1. Let P Rd be a subspace of dimension d ℓ, and let M = span{e1,...,ed ℓ}.447 Let R be an orthogonal matrix such that R P = M, let X P be a training dataset and let448 XR = {R x: x X}.

artificial intelligence, experiment, machine learning, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.90)

Add feedback