Xie, Mingyang
Can Hallucination Correction Improve Video-Language Alignment?
Zhao, Lingjun, Xie, Mingyang, Cascante-Bonilla, Paola, Daumé, Hal III, Lee, Kwonjoon
Large Vision-Language Models often generate hallucinated content that is not grounded in its visual inputs. While prior work focuses on mitigating hallucinations, we instead explore leveraging hallucination correction as a training objective to improve video-language alignment. We introduce HACA, a self-training framework learning to correct hallucinations in descriptions that do not align with the video content. By identifying and correcting inconsistencies, HACA enhances the model's ability to align video and textual representations for spatio-temporal reasoning. Our experimental results show consistent gains in video-caption binding and text-to-video retrieval tasks, demonstrating that hallucination correction-inspired tasks serve as an effective strategy for improving vision and language alignment.
Flash-Split: 2D Reflection Removal with Flash Cues and Latent Diffusion Separation
Wang, Tianfu, Xie, Mingyang, Cai, Haoming, Shah, Sachin, Metzler, Christopher A.
Transparent surfaces, such as glass, create complex reflections that obscure images and challenge downstream computer vision applications. We introduce Flash-Split, a robust framework for separating transmitted and reflected light using a single (potentially misaligned) pair of flash/no-flash images. Our core idea is to perform latent-space reflection separation while leveraging the flash cues. Specifically, Flash-Split consists of two stages. Stage 1 separates apart the reflection latent and transmission latent via a dual-branch diffusion model conditioned on an encoded flash/no-flash latent pair, effectively mitigating the flash/no-flash misalignment issue. Stage 2 restores high-resolution, faithful details to the separated latents, via a cross-latent decoding process conditioned on the original images before separation. By validating Flash-Split on challenging real-world scenes, we demonstrate state-of-the-art reflection separation performance and significantly outperform the baseline methods.
Flash-Splat: 3D Reflection Removal with Flash Cues and Gaussian Splats
Xie, Mingyang, Cai, Haoming, Shah, Sachin, Xu, Yiran, Feng, Brandon Y., Huang, Jia-Bin, Metzler, Christopher A.
We introduce a simple yet effective approach for separating transmitted and reflected light. Our key insight is that the powerful novel view synthesis capabilities provided by modern inverse rendering methods (e.g.,~3D Gaussian splatting) allow one to perform flash/no-flash reflection separation using unpaired measurements -- this relaxation dramatically simplifies image acquisition over conventional paired flash/no-flash reflection separation methods. Through extensive real-world experiments, we demonstrate our method, Flash-Splat, accurately reconstructs both transmitted and reflected scenes in 3D. Our method outperforms existing 3D reflection separation methods, which do not leverage illumination control, by a large margin. Our project webpage is at https://flash-splat.github.io/.
TurbuGAN: An Adversarial Learning Approach to Spatially-Varying Multiframe Blind Deconvolution with Applications to Imaging Through Turbulence
Feng, Brandon Yushan, Xie, Mingyang, Metzler, Christopher A.
We present a self-supervised and self-calibrating multi-shot approach to imaging through atmospheric turbulence, called TurbuGAN. Our approach requires no paired training data, adapts itself to the distribution of the turbulence, leverages domain-specific data priors, and can generalize from tens to thousands of measurements. We achieve such functionality through an adversarial sensing framework adapted from CryoGAN, which uses a discriminator network to match the distributions of captured and simulated measurements. Our framework builds on CryoGAN by (1) generalizing the forward measurement model to incorporate physically accurate and computationally efficient models for light propagation through anisoplanatic turbulence, (2) enabling adaptation to slightly misspecified forward models, and (3) leveraging domain-specific prior knowledge using pretrained generative networks, when available. We validate TurbuGAN on both computationally simulated and experimentally captured images distorted with anisoplanatic turbulence.
PROVES: Establishing Image Provenance using Semantic Signatures
Xie, Mingyang, Kulshrestha, Manav, Wang, Shaojie, Yang, Jinghan, Chakrabarti, Ayan, Zhang, Ning, Vorobeychik, Yevgeniy
Modern AI tools, such as generative adversarial networks, have transformed our ability to create and modify visual data with photorealistic results. However, one of the deleterious side-effects of these advances is the emergence of nefarious uses in manipulating information in visual data, such as through the use of deep fakes. We propose a novel architecture for preserving the provenance of semantic information in images to make them less susceptible to deep fake attacks. Our architecture includes semantic signing and verification steps. We apply this architecture to verifying two types of semantic information: individual identities (faces) and whether the photo was taken indoors or outdoors. Verification accounts for a collection of common image transformation, such as translation, scaling, cropping, and small rotations, and rejects adversarial transformations, such as adversarially perturbed or, in the case of face verification, swapped faces. Experiments demonstrate that in the case of provenance of faces in an image, our approach is robust to black-box adversarial transformations (which are rejected) as well as benign transformations (which are accepted), with few false negatives and false positives. Background verification, on the other hand, is susceptible to black-box adversarial examples, but becomes significantly more robust after adversarial training.