Goto

Collaborating Authors

 Kahatapitiya, Kumara


MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

arXiv.org Artificial Intelligence

We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR handles temporal planning, while DM focuses on spatial generation in an asymmetric network design: i) a MAR-based planning model containing most of the parameters generates planning signals for each masked frame using low-resolution input; ii) a lightweight generation model uses these signals to produce high-resolution frames via diffusion de-noising. MarDini's MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models.


Object-Centric Diffusion for Efficient Video Editing

arXiv.org Artificial Intelligence

Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as OCD, to further reduce latency by allocating computations more towards foreground edited regions that are arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient regions or background, allocating most of the model capacity to the former, and ii) Object-Centric 3D Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model \textit{without} retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality.


Token Turing Machines

arXiv.org Artificial Intelligence

Our model is for handling longer sequence lengths themselves are often inspired by the seminal Neural Turing Machine, and has an not sufficient since we do not want to run our entire transformer external memory consisting of a set of tokens which summarise model for each time step when a new observation the previous history (i.e., frames). This memory is (e.g., a new frame) is provided. This necessitates developing efficiently addressed, read and written using a Transformer models with explicit memories, enabling a model to fuse as the processing unit/controller at each step. The model's relevant past history with current observation to make a prediction memory module ensures that a new observation will only at current time step. Another desideratum for such be processed with the contents of the memory (and not the models, to scale to long sequence lengths, is that the computational entire history), meaning that it can efficiently process long cost at each time step should be constant, regardless sequences with a bounded computational cost at each step. of the length of the previous history. We show that TTM outperforms other alternatives, such as In this paper, we propose Token Turing Machines (TTMs), other Transformer models designed for long sequences and a sequential, auto-regressive model with external memory recurrent neural networks, on two real-world sequential visual and constant computational time complexity at each step.