Yang, Fu-En
MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching
Wu, Yen-Siang, Huang, Chi-Pin, Yang, Fu-En, Wang, Yu-Chiang Frank
Similarly, to control the pacing and flow of AI-generated videos, users should have control over the dynamics and composition of videos produced by generative models. To this end, numerous motion control methods [25, 33, 57, 59, 61, 63, 72] have been proposed to control moving object trajectories in videos generated by text-to-video (T2V) diffusion models [4, 17]. Motion customization, in particular, aims to control T2V diffusion models with the motion of a reference video [26, 31, 36, 71, 76]. With the assistance of the reference video, users are able to specify the desired object movements and camera framing in detail. Formally speaking, given a reference video, motion customization aims to adjust a pre-trained T2V diffusion model, so the output videos sampled from the adjusted model follow the object movements and camera framing of the reference video (see Figure 1 for an example). Given that motion is a high-level concept involving both spatial and temporal dimensions [65, 71], motion customization is considered a non-trivial task. Recently, many motion customization methods have been proposed to eliminate the influence of visual appearance in the reference video. Among them, a standout strategy is fine-tuning the pre-trained T2V diffusion model to reconstruct the frame differences of the reference video.
Language-Guided Transformer for Federated Multi-Label Classification
Liu, I-Jieh, Lin, Ci-Siang, Yang, Fu-En, Wang, Yu-Chiang Frank
Federated Learning (FL) is an emerging paradigm that enables multiple users to collaboratively train a robust model in a privacy-preserving manner without sharing their private data. Most existing approaches of FL only consider traditional single-label image classification, ignoring the impact when transferring the task to multi-label image classification. Nevertheless, it is still challenging for FL to deal with user heterogeneity in their local data distribution in the real-world FL scenario, and this issue becomes even more severe in multi-label image classification. Inspired by the recent success of Transformers in centralized settings, we propose a novel FL framework for multi-label classification. Since partial label correlation may be observed by local clients during training, direct aggregation of locally updated models would not produce satisfactory performances. Thus, we propose a novel FL framework of Language-Guided Transformer (FedLGT) to tackle this challenging task, which aims to exploit and transfer knowledge across different clients for learning a robust global model. Through extensive experiments on various multi-label datasets (e.g., FLAIR, MS-COCO, etc.), we show that our FedLGT is able to achieve satisfactory performance and outperforms standard FL techniques under multi-label FL scenarios. Code is available at https://github.com/Jack24658735/FedLGT.
TAX: Tendency-and-Assignment Explainer for Semantic Segmentation with Multi-Annotators
Cheng, Yuan-Chia, Shiau, Zu-Yun, Yang, Fu-En, Wang, Yu-Chiang Frank
To understand how deep neural networks perform classification predictions, recent research attention has been focusing on developing techniques to offer desirable explanations. However, most existing methods cannot be easily applied for semantic segmentation; moreover, they are not designed to offer interpretability under the multi-annotator setting. Instead of viewing ground-truth pixel-level labels annotated by a single annotator with consistent labeling tendency, we aim at providing interpretable semantic segmentation and answer two critical yet practical questions: "who" contributes to the resulting segmentation, and "why" such an assignment is determined. In this paper, we present a learning framework of Tendency-and-Assignment Explainer (TAX), designed to offer interpretability at the annotator and assignment levels. More specifically, we learn convolution kernel subsets for modeling labeling tendencies of each type of annotation, while a prototype bank is jointly observed to offer visual guidance for learning the above kernels. For evaluation, we consider both synthetic and real-world datasets with multi-annotators. We show that our TAX can be applied to state-of-the-art network architectures with comparable performances, while segmentation interpretability at both levels can be offered accordingly.
Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond
Hsieh, Cheng-Yen, Chang, Chih-Jung, Yang, Fu-En, Wang, Yu-Chiang Frank
While self-supervised learning has been shown to benefit a number of vision tasks, existing techniques mainly focus on image-level manipulation, which may not generalize well to downstream tasks at patch or pixel levels. Moreover, existing SSL methods might not sufficiently describe and associate the above representations within and across image scales. In this paper, we propose a Self-Supervised Pyramid Representation Learning (SS-PRL) framework. The proposed SS-PRL is designed to derive pyramid representations at patch levels via learning proper prototypes, with additional learners to observe and relate inherent semantic information within an image. In particular, we present a cross-scale patch-level correlation learning in SS-PRL, which allows the model to aggregate and associate information learned across patch scales. We show that, with our proposed SS-PRL for model pre-training, one can easily adapt and fine-tune the models for a variety of applications including multi-label classification, object detection, and instance segmentation.