permutation
Identifiable Bayesian Deep Generative Copulas with Unknown Layer Widths for Data with Arbitrary Marginal Distributions
Deep generative models offer powerful tools for multivariate data analysis, but their black-box architectures are often unidentified and difficult to interpret. We introduce the Deep Discrete Encoder (DDE) Copula, an identifiable and interpretable generative model for multivariate data with arbitrary marginal distributions. The model places a hierarchical directed network of binary latent variables inside a copula framework, enabling flexible dependence modeling for mixed discrete and continuous data. Estimation is based on rank likelihoods, which decouple marginal modeling from posterior inference on the DDE parameters and avoid specifying the marginal distributions. We establish conditions for identification of the DDE copula parameters, ensuring that layer-specific parameters provide meaningful summaries of multivariate dependence. We also prove quotient-space posterior consistency for continuous margins under the exact rank likelihood and treat the extended rank likelihood for tied or mixed margins as a generalized likelihood, with concentration under an additional contrast condition. For computation, we propose a stochastic expectation-maximization algorithm for \emph{maximum a posteriori} estimation, together with initialization strategies that improve convergence. To learn network dimension adaptively, we extend Bayesian rank-selection priors to infer layer-specific widths. Simulations show strong finite-sample performance, and a personality-survey analysis reveals interpretable hierarchical latent structure in complex multivariate data.
StrTransformer: Source-Wise Structured Transformers for Unsupervised Blind Source Recovery
This paper proposes StrTransformer, a source-wise structured Transformer framework for blind source recovery and branch-wise latent modeling. Instead of using an encoder to infer latent variables, StrTransformer directly optimizes the latent source matrix together with an observation-space mixer and source-wise structural Transformer branches. The mixer enforces reconstruction consistency, while each Transformer branch imposes a differentiable structural constraint on one latent source trajectory. Specifically, each source is converted into multi-scale patch tokens, randomly masked, processed by a locality-biased Transformer, and evaluated through a masked patch reconstruction energy. This energy acts as an implicit source-wise structural prior. To encourage different latent branches to specialize into different temporal regimes, StrTransformer further introduces an ordered multi-scale controller that learns branch-specific patch-scale weights, ordered scale centers, and locality attention slopes. The resulting objective combines observation reconstruction, source-wise structural regularization, and modular auxiliary penalties for separation and scale specialization. We analyze the decoupling and coupling structure of the objective, the regularized exact-reconstruction fiber, and the reduction of permutation symmetry induced by ordered branch descriptors. A controlled case study shows that the learned branches converge to distinct temporal-scale structures and recover source-aligned latent trajectories under post-hoc evaluation.
A Martingale Kernel Independence Test
Laumann, Felix, Liu, Zhaolu, Barahona, Mauricio
The Hilbert-Schmidt Independence Criterion (HSIC) and its joint-independence extension $d\mathrm{HSIC}$ are degenerate $V$-statistics whose data-dependent weighted-$ฯ^2$ null limits force a permutation calibration that multiplies the per-test cost by the number of permutations, in practice two orders of magnitude. Adapting the recent martingale MMD construction for two-sample testing to the (joint) independence problem, we introduce two studentised statistics whose null distributions are standard normal regardless of the data law, so that a single normal-quantile lookup replaces the permutation step entirely. The first, $m\mathrm{HSIC}$, is a self-normalised lower-triangular sum of the Hadamard product of two empirically centred Gram matrices. Under independence and bounded-fourth-moment kernels it converges to a standard normal. It is consistent against every fixed alternative, and runs at quadratic cost in the sample size without any sample split, matching the biased HSIC $V$-statistic. Our second statistic, $md\mathrm{HSIC}$, achieves finite-sample consistency with a single half-sample split: the centring is estimated on one half and the lower-triangular self-normalised martingale is run on the other, shrinking the conditional-mean residual to a quantity that is exponentially small in $d$, so the statistic is asymptotically standard normal at every fixed number of jointly tested variables, with a per-test cost that grows only linearly in $d$. On synthetic data with per-variable input dimension from $1$ to $500$ and between $2$ and $10$ jointly tested variables, both statistics match the empirical type-I error rate and test power of permutation-calibrated baselines while running $25$ to $60\times$ faster.
Causal Inference with Categorical Unobserved Confounder via Mixture Learning
Saha, Aytijhya, Bates, Stephen, Shah, Devavrat
Unobserved confounding is a fundamental challenge for estimating causal effects. To address unobserved confounding, recent literature has turned to two different approaches -- proxy variables and the use of multiple treatments. The first approach, commonly referred to as proximal causal inference, requires proxies to be assigned to specific asymmetric roles: treatment-inducing proxies (negative control exposures), variables that act as common causes of the treatment and outcome, and outcome-inducing proxies (negative control outcomes). In practice, however, identifying variables that satisfy these asymmetric roles can be difficult depending on the application domain. The second approach, commonly referred to as the ``Deconfounder," deals with multiple conditionally independent treatments. There has been limited progress towards developing a consistent estimation method for this setting. As the primary contribution of this work, we establish that causal effects are identifiable in both settings when the unobserved confounder is categorical under suitable conditions. Our approach builds on a mixture learning perspective: we show that the underlying confounding structure can be recovered by identifying the corresponding mixture distribution. We propose an estimation procedure based on tensor decomposition, which allows consistent recovery of the latent structure and comes with non-asymptotic guarantees. Simulation studies and real data experiments demonstrate that the proposed method performs well even with limited data.
Embedding Dimension Lower Bounds for Universality of Deep Sets and Janossy Pooling
Syed, Ali, Nambiar, Aditya, Siegel, Jonathan W.
In many practical applications it is important to build symmetries into neural network architectures. Consider the important case of permutation symmetry on point clouds consisting of $n$ points in $d$ dimensions. In this case the network learns a function on a set of $n$ points in $\mathbb{R}^d$, and a natural paradigm for constructing invariant networks is Janossy pooling, which generalizes the popular Deep Sets architecture. We study the universality of this approach, in particular the important question of how large the embedding dimension must be to guarantee universality of this architecture. Specifically, using a novel technique, we prove new lower bounds on the required size of this embedding dimension. For Deep Sets, this gives the correct minimal dimension up to a constant factor for all $d > 1$. For $k$-ary Janossy pooling, we prove the first non-trivial lower bound on the required embedding dimension when $k > 1$.
Kernel Selection is Model Selection: A Unified Complexity-Penalized Approach for MMD Two-Sample Tests
The Maximum Mean Discrepancy (MMD) is a cornerstone statistic for nonparametric two-sample testing, but its test power is dictated entirely by the chosen kernel. Because any fixed kernel inherently fails to distinguish certain distributions, the kernel must be dynamically optimized. However, data-driven optimization violates the foundational i.i.d. assumption, forcing a strict trade-off in existing frameworks. Ratio criteria ignore this dependence, inducing overfitting and variance collapse on rich kernel classes. Conversely, aggregation methods bypass the dependence using finite grids, but this strategy cannot scale to continuous search spaces like deep kernels. To break this dichotomy, we establish data-driven kernel selection as a model selection problem. We propose Complexity-Penalized MMD (CP-MMD), a criterion derived by applying the two-sample uniform concentration inequality of preceding works to the post-optimization MMD problem. The resulting penalty bounds the empirical MMD by the complexity of the kernel search space, mathematically absorbing the cost of optimization, so that CP-MMD enables direct, grid-free maximization over continuous parametric classes, including scalar bandwidths, polynomial feature bandwidths, and deep network parameters. By formally accounting for optimization complexity, we prove that CP-MMD maximizes true test power while ensuring unconditional Type-I validity. Consequently, CP-MMD enables grid-free kernel selection across linear, polynomial-feature, and deep regimes, matching or exceeding state-of-the-art test power.
End-to-End Identifiable and Consistent Recurrent Switching Dynamical Systems
Balsells-Rodas, Carles, Xiang, Zhengrui, Sumba, Xavier, Li, Yingzhen
Learning identifiable representations in deep generative models remains a fundamental challenge, particularly for sequential data with regime-switching dynamics. Existing approaches establish identifiability under restrictive assumptions, such as stationarity or limited emission models, and typically rely on variational autoencoder (VAE) estimators, which introduce approximation gaps that limit the recovery of the latent structure. In this work, we address both the theoretical and practical limitations of this setting. First, we establish identifiability of a broad class of recurrent nonlinear switching dynamical systems under flexible assumptions, significantly extending prior results. Second, we introduce $ฮฉ$SDS, a flow-based estimator that enables exact likelihood optimization using expectation-maximisation. Through empirical validation on both synthetic and real-world data, our results demonstrate that $ฮฉ$SDS achieves improved disentanglement compared to VAE-based estimators and more accurate forecasting of underlying dynamics.
Topological Signatures of Grokking
Tang, Yifan, Wang, Qiquan, Garcรญa-Redondo, Inรฉs, Monod, Anthea
We study the grokking phenomenon through the lens of topology. Using persistent homology on point clouds derived from the embedding matrices of a range of models trained on modular arithmetic with varying primes, we identify a clear and consistent topological signature of grokking: a sharp increase in both the maximum and total persistence of first homology ($H_1$). Persistence diagrams reveal the emergence of a dominant long-lived topological feature together with increasingly structured secondary features, reflecting the underlying cyclic structure of the task. Compared to existing spectral and geometric diagnostics -- specifically, Fourier analysis and local intrinsic dimension -- persistent homology provides a unified geometric and topological characterization of representation learning, capturing both local and global multi-scale structure. Ablations across data regimes and control settings show that these topological transitions are tied to generalization rather than memorization. Our results suggest that persistent homology offers a principled and interpretable framework for analyzing how neural networks internalize latent structure during training.
Proximal Projection for Doubly Sparse Regularized Models
He, Jia Wei, Ali, R. Ayesha, Darlington, Gerarda
Regularization is often used in high-dimensional regression settings to generate a sparse model, which can save tremendous computing resources and identify predictors that are most strongly associated with the response. When the predictors can be represented by a Gaussian graphical model, the structure of the predictor graph can be exploited during regularization. Our proposed model exploits this underlying predictor graph structure by decomposing the estimated coefficient vector into a sum of latent variables that correspond to the sum of each node contribution to the coefficient vector. Regularization is then performed on the latent variables rather than on the coefficient vector directly. We use a penalty function that permits a clear user-defined trade-off between the L1 and L2 penalties and propose a novel proximal projection during optimization. Further, our implementation computes the projection operator for the intersection of selected groups, which conserves more computing resources compared to predictor duplication methods, especially for high-dimensional data. Through simulation, we evaluate the performance of our approach under different graph structures and node counts, and present results on real-world data. Results suggest that our method exhibits stable performance relative to other singly or doubly sparse graphical regression models.