Goto

Collaborating Authors

 frame interpolation


Surface Aware Feed Forward Quadratic Gaussian for Frame Interpolation with Large Motion

Neural Information Processing Systems

Large motion poses a critical challenge in Video Frame Interpolation (VFI) task, as it requires accurate modeling of object correspondences across frames. Existing methods primarily rely on convolutional or attention-based models, which operate at the pixel or patch level. This inherently limits them to local object correspondences, making it difficult to capture frame-level object correspondences and often leading to failure under large motion. Inspired by the fundamental theorem of surface, we explore frame-level object correspondences through the lens of differential surface. The core idea is to represent video frames as 3D surfaces and align them by matching their surface properties, thereby achieving global surface alignment and frame-level object alignment.


OmniZoom: AUniversal Plug-and-Play Paradigm for Cross-Device Smooth Zoom Interpolation

Neural Information Processing Systems

Dual-camera smartphones suffer from geometric and photometric inconsistencies during zoom transitions, primarily due to disparities in intrinsic/extrinsic parameters and divergent image processing pipelines between the two cameras. Existing interpolation methods struggle to effectively address this issue, constrained by the lack of ground-truth datasets and motion ambiguity in dynamic scenarios. To overcome these challenges, we propose OmniZoom, a universal plug-and-play paradigm for cross-device smooth zoom interpolation. Specifically, we present a novel cross-device virtual data generation method utilizing 3DGaussian Splatting. This method tackles data scarcity by decoupling geometric features via spatial transition modeling and correcting photometric variations with dynamic color adaptation. It is further enhanced by cross-domain consistency learning for device-agnostic semantic alignment.


EPA Boosting Event based Video Frame Interpolation with Perceptually Aligned Learning

Neural Information Processing Systems

Event cameras, with their capacity to provide high temporal resolution information between frames, are increasingly utilized for video frame interpolation (VFI) in challenging scenarios characterized by high-speed motion and significant occlusion. However, prevalent issues of blur and distortion within the keyframes and ground truth data used for training and inference in these demanding conditions are frequently overlooked. This oversight impedes the perceptual realism and multiscene generalization capabilities of existing event-based VFI (E-VFI) methods when generating interpolated frames. Motivated by the observation that semanticperceptual discrepancies between degraded and pristine images are considerably smaller than their image-level differences, we introduce EPA. This novel E-VFI framework diverges from approaches reliant on direct image-level supervision by constructing multilevel, degradation-insensitive semantic perceptual supervisory signals to enhance the perceptual realism and multi-scene generalization of the model's predictions. Specifically, EPA operates in two phases: it first employs a DINO-based perceptual extractor, a customized style adapter, and a reconstruction generator to derive multi-layered, degradation-insensitive semantic-perceptual features (S).





Video Dynamics Prior: An Internal Learning Approach for Robust Video Enhancements

Neural Information Processing Systems

In this paper, we present a novel robust framework for low-level vision tasks, including denoising, object removal, frame interpolation, and super-resolution, that does not require any external training data corpus.




Video Dynamics Prior: An Internal Learning Approach for Robust Video Enhancements

Neural Information Processing Systems

In this paper, we present a novel robust framework for low-level vision tasks, including denoising, object removal, frame interpolation, and super-resolution, that does not require any external training data corpus. Our proposed approach directly learns the weights of neural modules by optimizing over the corrupted test sequence, leveraging the spatio-temporal coherence and internal statistics of videos. Furthermore, we introduce a novel spatial pyramid loss that leverages the property of spatio-temporal patch recurrence in a video across the different scales of the video. This loss enhances robustness to unstructured noise in both the spatial and temporal domains. This further results in our framework being highly robust to degradation in input frames and yields state-of-the-art results on downstream tasks such as denoising, object removal, and frame interpolation. To validate the effectiveness of our approach, we conduct qualitative and quantitative evaluations on standard video datasets such as DAVIS, UCF-101, and VIMEO90K-T.