Goto

Collaborating Authors

 Prince George's County


Appendix AVariational Paragraph Embedder A.1 Selection of substitution rate p

Neural Information Processing Systems

Figure 4: Impact of the proportion of injected noise for learning Paragraph Embeddings on XSum dataset. PPLint and the PPL of the generation obtained from training PLANNER on the corresponding z at different noise level. We observed when the value of p is within (0, 0.7), there Performing a grid search on each task using diffusion models is an expensive process. However, it has been observed that an increase in the value of p leads to a deviation between the two. This could be attributed to a higher conversion error that occurs when p is excessively large. A.2 Selection of number of latent code k The parameter k determines the number of latent codes used to represent a paragraph and therefore controls the compression level. Latent codes with smaller values of k are easier to model using the diffusion model, but may struggle to accurately preserve all the information in the original text. Additionally, smaller values of k offer computational efficiency as the sequence length for the diffusion model is k. To determine the best set of latent codes, we conducted experiments using three different methods: 1) selecting the first k hidden vectors, 2) selecting the last k hidden vectors, and 3) selecting interleaving hidden vectors, one for every L k hidden vectors. The results of the ablation study are presented in Table 5. Based on our findings, we observed no significant difference among the different choices, so we opted for option 1). Furthermore, we discovered that increasing the value of k does not lead to a dramatic improvement in performance. To balance between efficiency and performance, in most of our study we only use k =16 Setup BLEU_clean BLEU_robust First k (k=16) 79.59 43.17 A.3 Reconstruction, denoising and interpolation examples In Table 6, we present examples that demonstrate the adeptness of the trained Variational Paragraph Embedder in providing clean and denoised reconstructions. Additionally, we showcase interpolation results (Table 7, 8) derived from two random sentences in the hotel review dataset. The interpolated paragraph is usually coherent and incorporates inputs from both sentences, characterizing the distributional smoothness of the latent space. Reconstructed text complaints: after two nights stay, i asked the maid to clean our room (empty the wastebasket & make the bed). Denoising reconstruction (hotel review), noise level 0.3 Original text * * * check out the bathroom picture * * * i was in nyc by myself to watch some friends participate in the us olympic marathon trials. Corrupted text * * [unused697] check exams the bathroom picture * * slams i was in nyc mead myself yankee 2016 some scotch ruin in the outfielder olympicnca trials.



EasyToHard

Neural Information Processing Systems

Deep neural networks are powerful machines for visual pattern recognition, but reasoning tasks that are easy for humans may still be difficult for neural models. Humans possess the ability to extrapolate reasoning strategies learned on simple problems to solve harder examples, often by thinking for longer. For example, a person who has learned to solve small mazes can easily extend the very same search techniques to solve much larger mazes by spending more time. In computers, this behavior is often achieved through the use of algorithms, which scale to arbitrarily hard problem instances at the cost of more computation. In contrast, the sequential computing budget of feed-forward neural networks is limited by their depth, and networks trained on simple problems have no way of extending their reasoning to accommodate harder problems. In this work, we show that recurrent networks trained to solve simple problems with few recurrent steps can indeed solve much more complex problems simply by performing additional recurrences during inference. We demonstrate this algorithmic behavior of recurrent networks on prefix sum computation, mazes, and chess. In all three domains, networks trained on simple problem instances are able to extend their reasoning abilities at test time simply by "thinking for longer."


Beyond Expected Information Gain: Stable Bayesian Optimal Experimental Design with Integral Probability Metrics and Plug-and-Play Extensions

arXiv.org Machine Learning

Bayesian Optimal Experimental Design (BOED) provides a rigorous framework for decision-making tasks in which data acquisition is often the critical bottleneck, especially in resource-constrained settings. Traditionally, BOED typically selects designs by maximizing expected information gain (EIG), commonly defined through the Kullback-Leibler (KL) divergence. However, classical evaluation of EIG often involves challenging nested expectations, and even advanced variational methods leave the underlying log-density-ratio objective unchanged. As a result, support mismatch, tail underestimation, and rare-event sensitivity remain intrinsic concerns for KL-based BOED. To address these fundamental bottlenecks, we introduce an IPM-based BOED framework that replaces density-based divergences with integral probability metrics (IPMs), including the Wasserstein distance, Maximum Mean Discrepancy, and Energy Distance, resulting in a highly flexible plug-and-play BOED framework. We establish theoretical guarantees showing that IPM-based utilities provide stronger geometry-aware stability under surrogate-model error and prior misspecification than classical EIG-based utilities. We also validate the proposed framework empirically, demonstrating that IPM-based designs yield highly concentrated credible sets. Furthermore, by extending the same sample-based BOED template in a plug-and-play manner to geometry-aware discrepancies beyond the IPM class, illustrated by a neural optimal transport estimator, we achieve accurate optimal designs in high-dimensional settings where conventional nested Monte Carlo estimators and advanced variational methods fail.


PAC-Bayes Bounds for Gibbs Posteriors via Singular Learning Theory

arXiv.org Machine Learning

We derive explicit non-asymptotic PAC-Bayes generalization bounds for Gibbs posteriors, that is, data-dependent distributions over model parameters obtained by exponentially tilting a prior with the empirical risk. Unlike classical worst-case complexity bounds based on uniform laws of large numbers, which require explicit control of the model space in terms of metric entropy (integrals), our analysis yields posterior-averaged risk bounds that can be applied to overparameterized models and adapt to the data structure and the intrinsic model complexity. The bound involves a marginal-type integral over the parameter space, which we analyze using tools from singular learning theory to obtain explicit and practically meaningful characterizations of the posterior risk. Applications to low-rank matrix completion and ReLU neural network regression and classification show that the resulting bounds are analytically tractable and substantially tighter than classical complexity-based bounds. Our results highlight the potential of PAC-Bayes analysis for precise finite-sample generalization guarantees in modern overparameterized and singular models.


A Deep Generative Approach to Stratified Learning

arXiv.org Machine Learning

While the manifold hypothesis is widely adopted in modern machine learning, complex data is often better modeled as stratified spaces -- unions of manifolds (strata) of varying dimensions. Stratified learning is challenging due to varying dimensionality, intersection singularities, and lack of efficient models in learning the underlying distributions. We provide a deep generative approach to stratified learning by developing two generative frameworks for learning distributions on stratified spaces. The first is a sieve maximum likelihood approach realized via a dimension-aware mixture of variational autoencoders. The second is a diffusion-based framework that explores the score field structure of a mixture. We establish the convergence rates for learning both the ambient and intrinsic distributions, which are shown to be dependent on the intrinsic dimensions and smoothness of the underlying strata. Utilizing the geometry of the score field, we also establish consistency for estimating the intrinsic dimension of each stratum and propose an algorithm that consistently estimates both the number of strata and their dimensions. Theoretical results for both frameworks provide fundamental insights into the interplay of the underlying geometry, the ambient noise level, and deep generative models. Extensive simulations and real dataset applications, such as molecular dynamics, demonstrate the effectiveness of our methods.