Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

May-27-2025, 08:12:44 GMT–Neural Information Processing Systems

We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling visual representations of objects from a reference image, such as a frame in a video, and are trained to reconstruct the respective objects in a different image, e.g., a later frame in the video. Importantly, we encode object visuals from the reference image while conditioning on object poses from the target frame, which enables learning disentangled appearance and position features. Combining visual and 3D pose representations in a sequence-of-tokens format allows us to keep the text-to-image interface of existing models, with Neural Assets in place of text tokens.

3d-aware multi-object scene synthesis, image diffusion model, neural asset, (6 more...)

Neural Information Processing Systems

May-27-2025, 08:12:44 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.99)