Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior
Han, Fuqun, Osher, Stanley, Li, Wuchen
Modern generative models, such as neural ordinary differential equations (neural ODEs) [4], transformers [25], and diffusion models [22], have demonstrated remarkable ability to learn and generate samples from complex, high-dimensional probability distributions. These architectures have achieved broad success in scientific computing, image processing, and data science, offering scalable frameworks for data-driven modeling. However, training and sampling in such spaces remain expensive and highly sensitive to architectural and optimization choices. Despite these advances, the curse of dimensionality continues to present a fundamental challenge in many real-world applications. Fortunately, numerous problems in scientific computing exhibit intrinsic structures, such as sparsity, low-rank representations, or approximate invariances, that can be interpreted as prior information about the underlying data or operators. Leveraging such priors within generative models offers a promising avenue to improve both computational efficiency and generalization. A classical way to incorporate prior information, such as sparsity or piecewise regularity, is through Bayesian modeling, where the posterior combines a prior distribution encoding structural knowledge with a likelihood function derived from observations.
Oct-21-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- North America > United States
- California > Los Angeles County
- Los Angeles (0.28)
- South Carolina > Richland County
- Columbia (0.14)
- California > Los Angeles County
- Asia > Middle East
- Genre:
- Research Report (0.50)