imagenet32
Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts
Esfandiari, Yasin, Bauer, Stefan, Stich, Sebastian U., Dittadi, Andrea
Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pre-trained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning--only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models. Diffusion models are a class of probabilistic generative models that learn to approximate a data distribution by reversing a forward noising process through a learned denoising procedure (Sohl-Dickstein et al., 2015; Ho et al., 2020; Nichol & Dhariwal, 2021).
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Germany > Saarland (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
A Derivations
DenseNets and a bound of the Lipschitz for the activation functions. A.1 Derivation of Lipschitz constant K for the concatenation We know that a function f is K -Lipschitz if for all points v and w the following holds: d A.2 Derivation bounded Lipschitz Concatenated ReLU We define function: R! We have four different situations that can happen. For CIFAR10, the full i-DenseNets utilize 24.9M to utilize the 25.2M of Residual Flows. For ImageNet32, i-DenseNet utilizes 47.0M parameters to utilize the 47.1M of the Residual Flow.
OOD Detection with immature Models
Montazeran, Behrooz, Köthe, Ullrich
Likelihood-based deep generative models (DGMs) have gained significant attention for their ability to approximate the distributions of high-dimensional data. However, these models lack a performance guarantee in assigning higher likelihood values to in-distribution (ID) inputs, data the models are trained on, compared to out-of-distribution (OOD) inputs. This counter-intuitive behaviour is particularly pronounced when ID inputs are more complex than OOD data points. One potential approach to address this challenge involves leveraging the gradient of a data point with respect to the parameters of the DGMs. A recent OOD detection framework proposed estimating the joint density of layer-wise gradient norms for a given data point as a model-agnostic method, demonstrating superior performance compared to the Typicality Test across likelihood-based DGMs and image dataset pairs. In particular, most existing methods presuppose access to fully converged models, the training of which is both time-intensive and computationally demanding. In this work, we demonstrate that using immature models,stopped at early stages of training, can mostly achieve equivalent or even superior results on this downstream task compared to mature models capable of generating high-quality samples that closely resemble ID data. This novel finding enhances our understanding of how DGMs learn the distribution of ID data and highlights the potential of leveraging partially trained models for downstream tasks. Furthermore, we offer a possible explanation for this unexpected behaviour through the concept of support overlap.
- North America > Canada > Ontario > Toronto (0.14)
- Europe (0.14)
- North America > United States (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
On How Iterative Magnitude Pruning Discovers Local Receptive Fields in Fully Connected Neural Networks
Redman, William T., Wang, Zhangyang, Ingrosso, Alessandro, Goldt, Sebastian
Iterative magnitude pruning (IMP) [1] has emerged as a powerful tool for identifying sparse subnetworks ("winning tickets") that can be trained to perform as well as the dense model they are extracted from [2, 3]. That IMP, despite its simplicity, is more robust in discovering such winning tickets than other, more complex pruning schemes [4] suggests that its iterative coarse-graining [5] is especially capable of extracting and maintaining strong inductive biases. This perspective is strengthened by observations that winning tickets discovered by IMP: 1) have properties that make them transferable across related tasks [6-13] and architectures [14]; 2) can outperform dense models on classes with limited data [15]; 3) have less overconfident predictions [16]. The first direct evidence for IMP discovering good inductive biases came from studying the winning tickets extracted by IMP in fully connected neural networks (FCNs) [17]. Pellegrini and Biroli (2022) [17] found that the sparse subnetworks identified by IMP had local receptive field (RF) structure (Figure 1A), an architectural feature found in visual cortex [18] and convolutional neural networks (CNNs) [19]. Comparing IMP derived winning tickets with the sparse subnetworks found by oneshot pruning (Figure 1B), Pellegrini and Biroli (2022) [17] argued that the iterative nature of IMP was essential for refining the local RF structure. However, to-date, an understanding of how IMP, a pruning method based purely on the magnitude of the network parameters, is able to "sift out" non-localized weights remains unknown. Resolving this will not only shed light on the effect of IMP on FCNs, but also will provide new insight on the success of IMP more broadly.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > Middle East > Jordan (0.04)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)
- Contests & Prizes (1.00)
- Research Report > New Finding (0.68)
- Leisure & Entertainment (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.34)
Self-Guided Diffusion Models
Hu, Vincent Tao, Zhang, David W, Asano, Yuki M., Burghouts, Gertjan J., Snoek, Cees G. M.
Diffusion models have demonstrated remarkable progress in image generation quality, especially when guidance is used to control the generative process. However, guidance requires a large amount of image-annotation pairs for training and is thus dependent on their availability, correctness and unbiasedness. In this paper, we eliminate the need for such annotation by instead leveraging the flexibility of self-supervision signals to design a framework for self-guided diffusion models. By leveraging a feature extraction function and a self-annotation function, our method provides guidance signals at various image granularities: from the level of holistic images to object boxes and even segmentation masks. Our experiments on single-label and multi-label image datasets demonstrate that self-labeled guidance always outperforms diffusion models without guidance and may even surpass guidance based on ground-truth labels, especially on unbalanced data. When equipped with self-supervised box or mask proposals, our method further generates visually diverse yet semantically consistent images, without the need for any class, box, or segment label annotation. Self-guided diffusion is simple, flexible and expected to profit from deployment at scale. Source code will be at: https://taohu.me/sgdm/
The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning
Shi, Zhenmei, Chen, Jiefeng, Li, Kunyang, Raghuram, Jayaram, Wu, Xi, Liang, Yingyu, Jha, Somesh
Pre-training representations (a.k.a. foundation models) has recently become a prevalent learning paradigm, where one first pre-trains a representation using large-scale unlabeled data, and then learns simple predictors on top of the representation using small labeled data from the downstream tasks. There are two key desiderata for the representation: label efficiency (the ability to learn an accurate classifier on top of the representation with a small amount of labeled data) and universality (usefulness across a wide range of downstream tasks). In this paper, we focus on one of the most popular instantiations of this paradigm: contrastive learning with linear probing, i.e., learning a linear predictor on the representation pre-trained by contrastive learning. We show that there exists a trade-off between the two desiderata so that one may not be able to achieve both simultaneously. Specifically, we provide analysis using a theoretical data model and show that, while more diverse pre-training data result in more diverse features for different tasks (improving universality), it puts less emphasis on task-specific features, giving rise to larger sample complexity for down-stream supervised tasks, and thus worse prediction performance. Guided by this analysis, we propose a contrastive regularization method to improve the trade-off. We validate our analysis and method empirically with systematic experiments using real-world datasets and foundation models.
- Government (0.45)
- Leisure & Entertainment (0.45)
- Energy > Oil & Gas (0.34)
- Education > Educational Setting > Online (0.34)
Transfer Learning with Kernel Methods
Radhakrishnan, Adityanarayanan, Luyten, Max Ruiz, Prasad, Neha, Uhler, Caroline
Transfer learning refers to the machine learning problem of utilizing knowledge from a source task to improve performance on a target task. Recent approaches to transfer learning have achieved tremendous empirical success in many applications including in computer vision [17, 45], natural language processing [16, 40, 43], and the biomedical field [15, 19]. Since transfer learning approaches generally rely on complex deep neural networks, it can be difficult to characterize when and why they work [44]. Kernel methods [46] are conceptually and computationally simple machine learning models that have been found to be competitive with neural networks on a variety of tasks including image classification [3, 29, 42] and drug screening [42]. Their simplicity stems from the fact that training a kernel method involves performing linear regression after transforming the data. There has been renewed interest in kernels due to a recently established equivalence between wide neural networks and kernel methods [2, 25], which has led to the development of modern, neural tangent kernels (NTKs) that are competitive with neural networks. Given their simplicity and effectiveness, kernel methods could provide a powerful approach for transfer learning and also help characterize when transfer learning between a source and target task would be beneficial. However, developing an algorithm for transfer learning with kernel methods for general source and target tasks has been an open problem. In particular, while there is a standard transfer learning approach for neural networks that involves replacing and re-training the last layer of a pre-trained network, there is no known corresponding operation for kernels.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Massachusetts (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.94)
- Health & Medicine > Therapeutic Area > Oncology (0.46)
Hybrid Generative-Contrastive Representation Learning
Kim, Saehoon, Kim, Sungwoong, Lee, Juho
Unsupervised representation learning has recently received lots of interest due to its powerful generalizability through effectively leveraging large-scale unlabeled data. There are two prevalent approaches for this, contrastive learning and generative pre-training, where the former learns representations from instance-wise discrimination tasks and the latter learns them from estimating the likelihood. These seemingly orthogonal approaches have their own strengths and weaknesses. Contrastive learning tends to extract semantic information and discards details irrelevant for classifying objects, making the representations effective for discriminative tasks while degrading robustness to out-of-distribution data. On the other hand, the generative pre-training directly estimates the data distribution, so the representations tend to be robust but not optimal for discriminative tasks. In this paper, we show that we could achieve the best of both worlds by a hybrid training scheme. Specifically, we demonstrated that a transformer-based encoder-decoder architecture trained with both contrastive and generative losses can learn highly discriminative and robust representations without hurting the generative performance. We extensively validate our approach on various tasks.
Invertible DenseNets with Concatenated LipSwish
Perugachi-Diaz, Yura, Tomczak, Jakub M., Bhulai, Sandjai
We introduce Invertible Dense Networks (i-DenseNets), a more parameter efficient alternative to Residual Flows. The method relies on an analysis of the Lipschitz continuity of the concatenation in DenseNets, where we enforce invertibility of the network by satisfying the Lipschitz constant. We extend this method by proposing a learnable concatenation, which not only improves the model performance but also indicates the importance of the concatenated representation. Additionally, we introduce the Concatenated LipSwish as activation function, for which we show how to enforce the Lipschitz condition and which boosts performance. The new architecture, i-DenseNet, out-performs Residual Flow and other flow-based models on density estimation evaluated in bits per dimension, where we utilize an equal parameter budget. Moreover, we show that the proposed model out-performs Residual Flows when trained as a hybrid model where the model is both a generative and a discriminative model.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
Super-resolution Variational Auto-Encoders
Gatopoulos, Ioannis, Stol, Maarten, Tomczak, Jakub M.
The framework of variational autoencoders (VAEs) provides a principled method for jointly learning latent-variable models and corresponding inference models. However, the main drawback of this approach is the blurriness of the generated images. Some studies link this effect to the objective function, namely, the (negative) log-likelihood. Here, we propose to enhance VAEs by adding a random variable that is a downscaled version of the original image and still use the log-likelihood function as the learning objective. Further, by providing the downscaled image as an input to the decoder, it can be used in a manner similar to the super-resolution. We present empirically that the proposed approach performs comparably to VAEs in terms of the negative log-likelihood, but it obtains a better FID score in data synthesis.
- Europe > Netherlands > North Holland > Amsterdam (0.05)
- Asia > Middle East > Jordan (0.04)