generated image
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Canada (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Canada (0.04)
A Riemannian Manifold Definition 3 (Manifold) [36] Let M
Assume that the heat kernel is Lipschitz continous. Proof We start by introducing the following lemma, which is Proposition 4.4 in [20]. Following some previous work, we first define the projection operator. We then introduce the following lemma which utilize the projection operator. Readers who are interested may also refer to Chapter 5.3 in [8] and proof C (null) 0 as null 0, and Γ is the gamma function.
On the Generalization Limits of Quantum Generative Adversarial Networks with Pure State Generators
Frkatovic, Jasmin, Malemath, Akash, Kankeu, Ivan, Werner, Yannick, Tschöpe, Matthias, Rey, Vitor Fortes, Suh, Sungho, Lukowicz, Paul, Palaiodimopoulos, Nikolaos, Kiefer-Emmanouilidis, Maximilian
Over the past decade, advancements in model architectures, the availability of larger datasets, and improvements in hardware--among other factors--have significantly enhanced the capabilities of generative machine learning models [1-3]. At the same time, ongoing progress toward scalable quantum hardware has sparked growing interest in the development of quantum machine learning (QML) algorithms [4, 5], which aim to leverage quantum properties--such as superposition and entanglement--to enhance the efficiency and expressivity of classical machine learning approaches. Although large-scale fault-tolerant quantum hardware is not yet realizable, many QML algorithms are specifically designed to operate within the constraints of the noisy intermediate-scale quantum (NISQ) era [6-8]. In image generation tasks, several classical deep learning architectures have demonstrated notable effectiveness. Variational Autoencoders (VAEs) are particularly useful for tasks like image denoising [9] and anomaly detection [10] due to their structured latent spaces.
- Europe > Germany > Rhineland-Palatinate > Kaiserslautern (0.05)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
Thinking with Generated Images
Chern, Ethan, Hu, Zhulin, Chern, Steffi, Kou, Siqi, Su, Jiadi, Ma, Yan, Deng, Zhijie, Liu, Pengfei
We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at https://github.com/GAIR-NLP/thinking-with-generated-images.
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts.As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model's fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors.Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios where language bias and holistic image gist are insufficient. We benchmark state-of-the-art models on JourneyBench and analyze performance along a number of fine-grained dimensions. Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models, indicating that models' visual reasoning abilities are not as strong as they first appear. We discuss the implications of our findings and propose avenues for further research.
Efficient Differentially Private Fine-Tuning of Diffusion Models
Liu, Jing, Lowy, Andrew, Koike-Akino, Toshiaki, Parsons, Kieran, Wang, Ye
The recent developments of Diffusion Models (DMs) enable generation of astonishingly high-quality synthetic samples. Recent work showed that the synthetic samples generated by the diffusion model, which is pre-trained on public data and fully fine-tuned with differential privacy on private data, can train a downstream classifier, while achieving a good privacy-utility tradeoff. However, fully fine-tuning such large diffusion models with DP-SGD can be very resource-demanding in terms of memory usage and computation. In this work, we investigate Parameter-Efficient Fine-Tuning (PEFT) of diffusion models using Low-Dimensional Adaptation (LoDA) with Differential Privacy. We evaluate the proposed method with the MNIST and CIFAR-10 datasets and demonstrate that such efficient fine-tuning can also generate useful synthetic samples for training downstream classifiers, with guaranteed privacy protection of fine-tuning data. Our source code will be made available on GitHub.
- North America > United States > Wisconsin > Dane County > Madison (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (3 more...)
Erasing Concepts from Text-to-Image Diffusion Models with Few-shot Unlearning
Fuchi, Masane, Takagi, Tomohiro
Generating images from text has become easier because of the scaling of diffusion models and advancements in the field of vision and language. These models are trained using vast amounts of data from the Internet. Hence, they often contain undesirable content such as copyrighted material. As it is challenging to remove such data and retrain the models, methods for erasing specific concepts from pre-trained models have been investigated. We propose a novel concept-erasure method that updates the text encoder using few-shot unlearning in which a few real images are used. The discussion regarding the generated images after erasing a concept has been lacking. While there are methods for specifying the transition destination for concepts, the validity of the specified concepts is unclear. Our method implicitly achieves this by transitioning to the latent concepts inherent in the model or the images. Our method can erase a concept within 10 s, making concept erasure more accessible than ever before. Implicitly transitioning to related concepts leads to more natural concept erasure. We applied the proposed method to various concepts and confirmed that concept erasure can be achieved tens to hundreds of times faster than with current methods. By varying the parameters to be updated, we obtained results suggesting that, like previous research, knowledge is primarily accumulated in the feed-forward networks of the text encoder.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (7 more...)
Annotated Hands for Generative Models
Yang, Yue, Gandhi, Atith N, Turk, Greg
Generative models such as GANs and diffusion models have demonstrated impressive image generation capabilities. Despite these successes, these systems are surprisingly poor at creating images with hands. We propose a novel training framework for generative models that substantially improves the ability of such systems to create hand images. Our approach is to augment the training images with three additional channels that provide annotations to hands in the image. These annotations provide additional structure that coax the generative model to produce higher quality hand images. We demonstrate this approach on two different generative models: a generative adversarial network and a diffusion model. We demonstrate our method both on a new synthetic dataset of hand images and also on real photographs that contain hands. We measure the improved quality of the generated hands through higher confidence in finger joint identification using an off-the-shelf hand detector.
- Workflow (0.46)
- Research Report (0.40)
Qualitative Failures of Image Generation Models and Their Application in Detecting Deepfakes
The ability of image and video generation models to create photorealistic images has reached unprecedented heights, making it difficult to distinguish between real and fake images in many cases. However, despite this progress, a gap remains between the quality of generated images and those found in the real world. To address this, we have reviewed a vast body of literature from both academic publications and social media to identify qualitative shortcomings in image generation models, which we have classified into five categories. By understanding these failures, we can identify areas where these models need improvement, as well as develop strategies for detecting deep fakes. The prevalence of deep fakes in today's society is a serious concern, and our findings can help mitigate their negative impact.