Goto

Collaborating Authors

 adain


AdversarialStyleMiningforOne-Shot Unsupervised DomainAdaptation

Neural Information Processing Systems

Theintroduction ofDomainAdaptation (DA)techniquesaims to mitigate such performance drop when a trained agent encounters a different environment. By bridging the distribution gap between source and target domains, DA methods have shown their effect in many cross-domain tasks such as classification [27, 18], segmentation [19, 22, 23] and detection[3].




A Spatial Conditioning Without Bubble Artifacts

Neural Information Processing Systems

Let us begin by recalling how SP ADE works, and study where its defects come from. These statistics are calculated via averages over examples and all spatial dimensions. In Figure 4, we can see that SP ADE has these droplet artifacts as well. Despite the rationale behind this idea, we could not find settings where we noticed a decrease in distortion that was not accompanied by a drastic decrease in quality. SSNs trained in FFHQ at 256 x 256 resolution.



Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation

Chigot, Estelle, Wilson, Dennis G., Ghrib, Meriem, Oberlin, Thomas

arXiv.org Artificial Intelligence

Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: https://github.com/echigot/cactif.


our performance remarkable (R1,R2,R3,R4,R5) and identified our contribution to this challenging OSUDA problem

Neural Information Processing Systems

We thank the reviewers for their thoughtful feedback! We are pleased to get a positive average score where R2,R4 and R5 gave positive feedback. We will incorporate all feedback in the revision. Here we'd like to emphasize our motivation for ASM again. RAIN seems only a complex version of AdaIN, which is not very attractive.


StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Li, Yinghao Aaron, Han, Cong, Mesgarani, Nima

arXiv.org Artificial Intelligence

Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems, but producing speech with naturalistic prosodic variations, speaking styles and emotional tones remains challenging. Moreover, since duration and speech are generated separately, parallel TTS models still have problems finding the best monotonic alignments that are crucial for naturalistic speech synthesis. Here, we propose StyleTTS, a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. With novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation schemes, our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets in subjective tests of speech naturalness and speaker similarity. Through self-supervised learning of the speaking styles, our model can synthesize speech with the same prosodic and emotional tone as any given reference speech without the need for explicitly labeling these categories.


Inherently Interpretable Multi-Label Classification Using Class-Specific Counterfactuals

Sun, Susu, Woerner, Stefano, Maier, Andreas, Koch, Lisa M., Baumgartner, Christian F.

arXiv.org Artificial Intelligence

Interpretability is essential for machine learning algorithms in high-stakes application fields such as medical image analysis. However, high-performing black-box neural networks do not provide explanations for their predictions, which can lead to mistrust and suboptimal human-ML collaboration. Post-hoc explanation techniques, which are widely used in practice, have been shown to suffer from severe conceptual problems. Furthermore, as we show in this paper, current explanation techniques do not perform adequately in the multi-label scenario, in which multiple medical findings may co-occur in a single image. We propose Attri-Net, an inherently interpretable model for multi-label classification. Attri-Net is a powerful classifier that provides transparent, trustworthy, and human-understandable explanations. The model first generates class-specific attribution maps based on counterfactuals to identify which image regions correspond to certain medical findings. Then a simple logistic regression classifier is used to make predictions based solely on these attribution maps. We compare Attri-Net to five post-hoc explanation techniques and one inherently interpretable classifier on three chest X-ray datasets. We find that Attri-Net produces high-quality multi-label explanations consistent with clinical knowledge and has comparable classification performance to state-of-the-art classification models.


Factor Decomposed Generative Adversarial Networks for Text-to-Image Synthesis

Li, Jiguo, Liu, Xiaobin, Zheng, Lirong

arXiv.org Artificial Intelligence

Prior works about text-to-image synthesis typically concatenated the sentence embedding with the noise vector, while the sentence embedding and the noise vector are two different factors, which control the different aspects of the generation. Simply concatenating them will entangle the latent factors and encumber the generative model. In this paper, we attempt to decompose these two factors and propose Factor Decomposed Generative Adversarial Networks~(FDGAN). To achieve this, we firstly generate images from the noise vector and then apply the sentence embedding in the normalization layer for both generator and discriminators. We also design an additive norm layer to align and fuse the text-image features. The experimental results show that decomposing the noise and the sentence embedding can disentangle latent factors in text-to-image synthesis, and make the generative model more efficient. Compared with the baseline, FDGAN can achieve better performance, while fewer parameters are used.