equivalently
Gaussian mixture models in Hilbert spaces via kernel methods
López-Montero, Daniel, Álvarez-López, Antonio, Matabuena, Marcos
Modern datasets across many disciplines increasingly consist of time-evolving, potentially infinite-dimensional random objects, such as dynamic functional data, which are naturally modeled in Hilbert spaces. In these settings, characterizing probability measures, for example, through densities, can be ill-defined or technically challenging. Motivated by clustering applications, we propose a Gaussian mixture framework for Hilbert-space-valued data based on kernel mean embeddings and develop efficient optimization algorithms for estimation. We establish theoretical guarantees showing that the proposed algorithm is well defined and that the model yields a dense class of approximations in infinite-dimensional spaces. We evaluate the framework through extensive experiments on diverse structures and data geometries, including $L^2$-functional data and random graphs in Laplacian spaces arising in modern medical applications.
Contents of Appendix
Bayes-consistency only holds for the full family of measurable functions, which of course is distinct from the more restricted hypothesis set used by a learning algorithm. Therefore, a hypothesis setdependent notion of H-consistency has been proposed by Long and Servedio (2013) in the realizable setting, used by Zhang and Agarwal (2020) for linear models, and generalized by Kuznetsov et al. (2014) to the structured prediction case. Long and Servedio (2013) showed that there exists a case where a Bayes-consistent loss is not H-consistent while inconsistent losses can be H-consistent. Zhang and Agarwal (2020) further investigated the phenomenon in (Long and Servedio, 2013) and showed that the situation of losses that are not H-consistent with linear models can be remedied by carefully choosing a larger piecewise linear hypothesis set. Kuznetsov et al. (2014) proved positive results for the H-consistency of several multi-class ensemble algorithms, as an extension of H-consistency results in (Long and Servedio, 2013). Recently, the notions of H-calibration and H-consistency have been used by Bao et al. (2020); Awasthi et al. (2021a) in the study of adversarial binary classification losses, as defined in (Goodfellow et al., 2014; Madry et al., 2017; Tsipras et al., 2018; Carlini and Wagner, 2017; Awasthi et al., 2023).
Few Batches or Little Memory, But Not Both: Simultaneous Space and Adaptivity Constraints in Stochastic Bandits
Huang, Ruiyuan, Lyu, Zicheng, Zhu, Xiaoyi, Huang, Zengfeng
We study stochastic multi-armed bandits under simultaneous constraints on space and adaptivity: the learner interacts with the environment in $B$ batches and has only $W$ bits of persistent memory. Prior work shows that each constraint alone is surprisingly mild: near-minimax regret $\widetilde{O}(\sqrt{KT})$ is achievable with $O(\log T)$ bits of memory under fully adaptive interaction, and with a $K$-independent $O(\log\log T)$-type number of batches when memory is unrestricted. We show that this picture breaks down in the simultaneously constrained regime. We prove that any algorithm with a $W$-bit memory constraint must use at least $Ω(K/W)$ batches to achieve near-minimax regret $\widetilde{O}(\sqrt{KT})$, even under adaptive grids. In particular, logarithmic memory rules out $O(K^{1-\varepsilon})$ batch complexity. Our proof is based on an information bottleneck. We show that near-minimax regret forces the learner to acquire $Ω(K)$ bits of information about the hidden set of good arms under a suitable hard prior, whereas an algorithm with $B$ batches and $W$ bits of memory allows only $O(BW)$ bits of information. A key ingredient is a localized change-of-measure lemma that yields probability-level arm exploration guarantees, which is of independent interest. We also give an algorithm that, for any bit budget $W$ with $Ω(\log T) \le W \le O(K\log T)$, uses at most $W$ bits of memory and $\widetilde{O}(K/W)$ batches while achieving regret $\widetilde{O}(\sqrt{KT})$, nearly matching our lower bound up to polylogarithmic factors.
Fair regression under localized demographic parity constraints
Charpentier, Arthur, Denis, Christophe, Elie, Romuald, Hebiri, Mohamed, HU, François
Demographic parity (DP) is a widely used group fairness criterion requiring predictive distributions to be invariant across sensitive groups. While natural in classification, full distributional DP is often overly restrictive in regression and can lead to substantial accuracy loss. We propose a relaxation of DP tailored to regression, enforcing parity only at a finite set of quantile levels and/or score thresholds. Concretely, we introduce a novel (${\ell}$, Z)-fair predictor, which imposes groupwise CDF constraints of the form F f |S=s (z m ) = ${\ell}$ m for prescribed pairs (${\ell}$ m , z m ). For this setting, we derive closed-form characterizations of the optimal fair discretized predictor via a Lagrangian dual formulation and quantify the discretization cost, showing that the risk gap to the continuous optimum vanishes as the grid is refined. We further develop a model-agnostic post-processing algorithm based on two samples (labeled for learning a base regressor and unlabeled for calibration), and establish finite-sample guarantees on constraint violation and excess penalized risk. In addition, we introduce two alternative frameworks where we match group and marginal CDF values at selected score thresholds. In both settings, we provide closed-form solutions for the optimal fair discretized predictor. Experiments on synthetic and real datasets illustrate an interpretable fairness-accuracy trade-off, enabling targeted corrections at decision-relevant quantiles or thresholds while preserving predictive performance.
Discrete Causal Representation Learning
Zhang, Wenjin, Wang, Yixin, Gu, Yuqi
Causal representation learning seeks to uncover causal relationships among high-level latent variables from low-level, entangled, and noisy observations. Existing approaches often either rely on deep neural networks, which lack interpretability and formal guarantees, or impose restrictive assumptions like linearity, continuous-only observations, and strong structural priors. These limitations particularly challenge applications with a large number of discrete latent variables and mixed-type observations. To address these challenges, we propose discrete causal representation learning (DCRL), a generative framework that models a directed acyclic graph among discrete latent variables, along with a sparse bipartite graph linking latent and observed layers. This design accommodates continuous, count, and binary responses through flexible measurement models while maintaining interpretability. Under mild conditions, we prove that both the bipartite measurement graph and the latent causal graph are identifiable from the observed data distribution alone. We further propose a three-stage estimate-resample-discovery pipeline: penalized estimation of the generative model parameters, resampling of latent configurations from the fitted model, and score-based causal discovery on the resampled latents. We establish the consistency of this procedure, ensuring reliable recovery of the latent causal structure. Empirical studies on educational assessment and synthetic image data demonstrate that DCRL recovers sparse and interpretable latent causal structures.
LassoFlexNet: Flexible Neural Architecture for Tabular Data
Lui, Kry Yik Chau, Chi, Cheng, Basu, Kishore, Cao, Yanshuai
Despite their dominance in vision and language, deep neural networks often underperform relative to tree-based models on tabular data. To bridge this gap, we incorporate five key inductive biases into deep learning: robustness to irrelevant features, axis alignment, localized irregularities, feature heterogeneity, and training stability. We propose \emph{LassoFlexNet}, an architecture that evaluates the linear and nonlinear marginal contribution of each input via Per-Feature Embeddings, and sparsely selects relevant variables using a Tied Group Lasso mechanism. Because these components introduce optimization challenges that destabilize standard proximal methods, we develop a \emph{Sequential Hierarchical Proximal Adaptive Gradient optimizer with exponential moving averages (EMA)} to ensure stable convergence. Across $52$ datasets from three benchmarks, LassoFlexNet matches or outperforms leading tree-based models, achieving up to a $10$\% relative gain, while maintaining Lasso-like interpretability. We substantiate these empirical results with ablation studies and theoretical proofs confirming the architecture's enhanced expressivity and structural breaking of undesired rotational invariance.
Hard labels sampled from sparse targets mislead rotation invariant algorithms
Ghosh, Avrajit, Yu, Bin, Warmuth, Manfred, Bartlett, Peter
One of the most common machine learning setups is logistic regression. In many classification models, including neural networks, the final prediction is obtained by applying a logistic link function to a linear score. In binary logistic regression, the feedback can be either soft labels, corresponding to the true conditional probability of the data (as in distillation), or sampled hard labels (taking values $\pm 1$). We point out a fundamental problem that arises even in a particularly favorable setting, where the goal is to learn a noise-free soft target of the form $σ(\mathbf{x}^{\top}\mathbf{w}^{\star})$. In the over-constrained case (i.e. the number of samples $n$ exceeds the input dimension $d$) with examples $(\mathbf{x}_i,σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star}))$, it is sufficient to recover $\mathbf{w}^{\star}$ and hence achieve the Bayes risk. However, we prove that when the examples are labeled by hard labels $y_i$ sampled from the same conditional distribution $σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star})$ and $\mathbf{w}^{\star}$ is $s$-sparse, then rotation-invariant algorithms are provably suboptimal: they incur an excess risk $Ω\!\left(\frac{d-1}{n}\right)$, while there are simple non-rotation invariant algorithms with excess risk $O(\frac{s\log d}{n})$. The simplest rotation invariant algorithm is gradient descent on the logistic loss (with early stopping). A simple non-rotation-invariant algorithm for sparse targets that achieves the above upper bounds uses gradient descent on the weights $u_i,v_i$, where now the linear weight $w_i$ is reparameterized as $u_iv_i$.
Locally-AdaptiveNonparametricOnlineLearning: SupplementaryMaterial
In case of generic convex losses, we use the more complex parameterless algorithm AdaNormalHedge. The following theorem states a slightly more general bound that holds for anyη-exp-concave loss function (for completeness,theproofisgiveninAppendixD). Nownotethatalthough the algorithm is actually initialized withw1,i = 1, Lemma 1 shows that the regret remains the same if we assume the algorithm is initialized withwE1. Suppose that Algorithm 5 is run using predictions and updates provided by AdaNormalHedge. Asinourlocally-adaptive setting node experts are local learners,byi,t should be viewed as the prediction of the local online learning algorithm sitting at nodeiof the tree.