Goto

Collaborating Authors

 cond


Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows

Neural Information Processing Systems

Autoregressive models have driven remarkable progress in language modeling. Their foundational reliance on discrete tokens, unidirectional context, and singlepass decoding, while central to their success, also inspires the exploration of a design space that could offer new axes of modeling flexibility. In this work, we explore an alternative paradigm, shifting language modeling from a discrete token space to a continuous latent space. We propose a novel framework TarFlowLM, that employs transformer-based autoregressive normalizing flows [73] to model these continuous representations. This approach unlocks substantial flexibility, enabling the construction of models that can capture global bi-directional context through stacked, alternating-direction autoregressive transformations, support block-wise generation with flexible token patch sizes, and facilitate a hierarchical multi-pass generation process. We further propose new mixture-based coupling transformations designed to capture complex dependencies within the latent space shaped by discrete data, and demonstrate theoretical connections to conventional discrete autoregressive models. Extensive experiments on language modeling benchmarks demonstrate strong likelihood performance and highlight the flexible modeling capabilities inherent in our framework.


Diffusion Transformers for Imputation: Statistical Efficiency and Uncertainty Quantification

Neural Information Processing Systems

Imputation methods play a critical role in enhancing the quality of practical timeseries data, which often suffer from pervasive missing values. Recently, diffusionbased generative imputation methods have demonstrated remarkable success compared to autoregressive and conventional statistical approaches. Despite their empirical success, the theoretical understanding of how well diffusion-based models capture complex spatial and temporal dependencies between the missing values and observed ones remains limited.





Robust low-rank training via approximate orthonormal constraints

Neural Information Processing Systems

By modeling robustness in terms of the condition number of the neural network, we argue that this loss of robustness is due to the exploding singular values of the low-rank weight matrices.



Plug-in Estimation in High-Dimensional Linear Inverse Problems: A Rigorous Analysis

Neural Information Processing Systems

Estimating a vector x from noisy linear measurements Ax + w often requires use of prior knowledge or structural constraints on x for accurate reconstruction. Several recent works have considered combining linear least-squares estimation with a generic or "plug-in" denoiser function that can be designed in a modular manner based on the prior knowledge about x.