Coordinate In and Value Out: Training Flow Transformers in Ambient Space
Wang, Yuyang, Ranjan, Anurag, Susskind, Josh, Bautista, Miguel Angel
–arXiv.org Artificial Intelligence
Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on unstructured data like 3D point clouds. These models are commonly trained in two stages: first, a data compressor (i.e. a variational auto-encoder) is trained, and in a subsequent training stage a flow matching generative model is trained in the low-dimensional latent space of the data compressor. This two stage paradigm adds complexity to the overall training recipe and sets obstacles for unifying models across data domains, as specific data compressors are used for different data modalities. To this end, we introduce Ambient Space Flow Transformers (ASFT), a domain-agnostic approach to learn flow matching transformers in ambient space, sidestepping the requirement of training compressors and simplifying the training process. We introduce a conditionally independent point-wise training objective that enables ASFT to make predictions continuously in coordinate space. Our empirical results demonstrate that using general purpose transformer blocks, ASFT effectively handles different data modalities such as images and 3D point clouds, achieving strong performance in both domains and outperforming comparable approaches. ASFT is a promising step towards domain-agnostic flow matching generative models that can be trivially adopted in different data domains. Recent advances in generative modeling have enabled learning complex data distributions by combining both powerful architectures and training objectives. In particular, state-of-the-art approaches for image (Esser et al., 2024), video (Dai et al., 2023) or 3D point cloud (Vahdat et al., 2022) generation are based on the concept of iteratively transforming data into Gaussian noise. Diffusion models were originally proposed following this idea and pushing the quality of generated samples in many different domains, including images (Dai et al., 2023; Rombach et al., 2022), 3D point clouds (Luo & Hu, 2021), graphs (Hoogeboom et al., 2022) and video (Ho et al., 2022a). More recently, flow matching (Lipman et al., 2023) and stochastic interpolants (Ma et al., 2024) have been proposed as generalized formulations of the noising process, moving from stochastic gaussian diffusion processes to general paths connecting a base (e.g.
arXiv.org Artificial Intelligence
Dec-4-2024