Goto

Collaborating Authors

 Image Processing


Looks Too Good To Be True: An Information-Theoretic Analysis of Hallucinations in Generative Restoration Models

Neural Information Processing Systems

The pursuit of high perceptual quality in image restoration has driven the development of revolutionary generative models, capable of producing results often visually indistinguishable from real data. However, as their perceptual quality continues to improve, these models also exhibit a growing tendency to generate hallucinations - realistic-looking details that do not exist in the ground truth images. Hallucinations in these models create uncertainty about their reliability, raising major concerns about their practical application. This paper investigates this phenomenon through the lens of information theory, revealing a fundamental tradeoff between uncertainty and perception. We rigorously analyze the relationship between these two factors, proving that the global minimal uncertainty in generative models grows in tandem with perception. In particular, we define the inherent uncertainty of the restoration problem and show that attaining perfect perceptual quality entails at least twice this uncertainty. Additionally, we establish a relation between distortion, uncertainty and perception, through which we prove the aforementioned uncertainly-perception tradeoff induces the well-known perception-distortion tradeoff. We demonstrate our theoretical findings through experiments with super-resolution and inpainting algorithms.


ParallelEdits: Efficient Multi-Aspect Text-Driven Image Editing with Attention Grouping

Neural Information Processing Systems

Multiple edits in images pose a significant challenge in existing models (such as DirectInverison [1] and InfEdit [2]), as their performance downgrades with an increasing number of aspects. In contrast, our ParallelEdits can achieve precise multi-aspect image editing in 5 seconds. The symbol denotes a swap action, the symbol denotes an object addition action, and the symbol denotes an object deletion.





TFG: Unified Training-Free Guidance for Diffusion Models 2

Neural Information Processing Systems

Given an unconditional diffusion model and a predictor for a target property of interest (e.g., a classifier), the goal of training-free guidance is to generate samples with desirable target properties without additional training.


Learning Loss for Test-Time Augmentation

Neural Information Processing Systems

Data augmentation has been actively studied for robust neural networks. Most of the recent data augmentation methods focus on augmenting datasets during the training phase. At the testing phase, simple transformations are still widely used for test-time augmentation. This paper proposes a novel instance-level testtime augmentation that efficiently selects suitable transformations for a test input. Our proposed method involves an auxiliary module to predict the loss of each possible transformation given the input. Then, the transformations having lower predicted losses are applied to the input.


Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning Brandon Huang 1* Chancharik Mitra 1* Leonid Karlinsky 3

Neural Information Processing Systems

The recent success of interleaved Large Multimodal Models (LMMs) in fewshot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens.


OneActor: Consistent Subject Generation via Cluster-Conditioned Guidance

Neural Information Processing Systems

Text-to-image diffusion models benefit artists with high-quality image generation. Yet their stochastic nature hinders artists from creating consistent images of the same subject. Existing methods try to tackle this challenge and generate consistent content in various ways. However, they either depend on external restricted data or require expensive tuning of the diffusion model. For this issue, we propose a novel one-shot tuning paradigm, termed OneActor.


Spatio-Temporal Interactive Learning for Efficient Image Reconstruction of Spiking Cameras

Neural Information Processing Systems

The spiking camera is an emerging neuromorphic vision sensor that records highspeed motion scenes by asynchronously firing continuous binary spike streams. Prevailing image reconstruction methods, generating intermediate frames from these spike streams, often rely on complex step-by-step network architectures that overlook the intrinsic collaboration of spatio-temporal complementary information. In this paper, we propose an efficient spatio-temporal interactive reconstruction network to jointly perform inter-frame feature alignment and intra-frame feature filtering in a coarse-to-fine manner. Specifically, it starts by extracting hierarchical features from a concise hybrid spike representation, then refines the motion fields and target frames scale-by-scale, ultimately obtaining a full-resolution output. Meanwhile, we introduce a symmetric interactive attention block and a multimotion field estimation block to further enhance the interaction capability of the overall network. Experiments on synthetic and real-captured data show that our approach exhibits excellent performance while maintaining low model complexity. The code is available at https://github.com/GitCVfb/STIR.