Xu, Dan
DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head Video Generation
Hong, Fa-Ting, Shen, Li, Xu, Dan
Predominant techniques on talking head generation largely depend on 2D information, including facial appearances and motions from input face images. Nevertheless, dense 3D facial geometry, such as pixel-wise depth, plays a critical role in constructing accurate 3D facial structures and suppressing complex background noises for generation. However, dense 3D annotations for facial videos is prohibitively costly to obtain. In this work, firstly, we present a novel self-supervised method for learning dense 3D facial geometry (ie, depth) from face videos, without requiring camera parameters and 3D geometry annotations in training. We further propose a strategy to learn pixel-level uncertainties to perceive more reliable rigid-motion pixels for geometry learning. Secondly, we design an effective geometry-guided facial keypoint estimation module, providing accurate keypoints for generating motion fields. Lastly, we develop a 3D-aware cross-modal (ie, appearance and depth) attention mechanism, which can be applied to each generation layer, to capture facial geometries in a coarse-to-fine manner. Extensive experiments are conducted on three challenging benchmarks (ie, VoxCeleb1, VoxCeleb2, and HDTF). The results demonstrate that our proposed framework can generate highly realistic-looking reenacted talking videos, with new state-of-the-art performances established on these benchmarks. The codes and trained models are publicly available on the GitHub project page at https://github.com/harlanhong/CVPR2022-DaGAN
Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors
Ding, Lihe, Dong, Shaocong, Huang, Zhanpeng, Wang, Zibin, Zhang, Yiyuan, Gong, Kaixiong, Xu, Dan, Xue, Tianfan
Most 3D generation research focuses on up-projecting 2D foundation models into the 3D space, either by minimizing 2D Score Distillation Sampling (SDS) loss or fine-tuning on multi-view datasets. Without explicit 3D priors, these methods often lead to geometric anomalies and multi-view inconsistency. Recently, researchers have attempted to improve the genuineness of 3D objects by directly training on 3D datasets, albeit at the cost of low-quality texture generation due to the limited texture diversity in 3D datasets. To harness the advantages of both approaches, we propose Bidirectional Diffusion(BiDiff), a unified framework that incorporates both a 3D and a 2D diffusion process, to preserve both 3D fidelity and 2D texture richness, respectively. Moreover, as a simple combination may yield inconsistent generation results, we further bridge them with novel bidirectional guidance. In addition, our method can be used as an initialization of optimization-based models to further improve the quality of 3D model and efficiency of optimization, reducing the generation process from 3.4 hours to 20 minutes. Experimental results have shown that our model achieves high-quality, diverse, and scalable 3D generation. Project website: https://bidiff.github.io/.
SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis
Ye, Hanrong, Kuen, Jason, Liu, Qing, Lin, Zhe, Price, Brian, Xu, Dan
Figure 1: Effectiveness of SegGen: Through training with synthetic data generated by the proposed SegGen, we significantly boost the performance of state-of-the-art segmentation model Mask2Former (Cheng et al., 2022) on evaluation benchmarks including ADE20K (Zhou et al., 2016) and COCO (Lin et al., 2014), whilst making it more robust towards challenging images from other domains (the three columns on the left are from PASCAL (Everingham et al., 2015); the three on the right are synthesized by the text-to-image generation model Kandinsky 2 (Forever, 2023)). We propose SegGen, a highly-effective training data generation method for image segmentation, which pushes the performance limits of state-of-the-art segmentation models to a significant extent. On the highly competitive ADE20K and COCO benchmarks, our data generation method markedly improves the performance of state-of-the-art segmentation models in semantic segmentation, panoptic segmentation, and instance segmentation. Notably, in terms of the ADE20K mIoU, Mask2Former R50 is largely boosted from 47.2 to 49.9 (+2.7); Mask2Former Swin-L is also significantly increased from 56.1 to 57.4 (+1.3). These promising results strongly suggest the effectiveness of our SegGen even when abundant human-annotated training data is utilized. Moreover, training with our synthetic data makes the segmentation models more robust towards unseen domains. Image segmentation explores the identification of objects in visual inputs at the pixel level. Based on the different emphases on category and instance membership information, researchers have divided image segmentation into several tasks (Long et al., 2015; Chen et al., 2015; Kirillov et al., 2019; Qi et al., 2022). For example, semantic segmentation studies pixel-level understanding of object categories, instance segmentation focuses on instance grouping of pixels, while panoptic segmentation considers both. Figure 2: Illustration of the workflow of our proposed SegGen.
Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head video Generation
Hong, Fa-Ting, Xu, Dan
Talking head video generation aims to animate a human face in a still image with dynamic poses and expressions using motion information derived from a target-driving video, while maintaining the person's identity in the source image. However, dramatic and complex motions in the driving video cause ambiguous generation, because the still source image cannot provide sufficient appearance information for occluded regions or delicate expression variations, which produces severe artifacts and significantly degrades the generation quality. To tackle this problem, we propose to learn a global facial representation space, and design a novel implicit identity representation conditioned memory compensation network, coined as MCNet, for high-fidelity talking head generation.~Specifically, we devise a network module to learn a unified spatial facial meta-memory bank from all training samples, which can provide rich facial structure and appearance priors to compensate warped source facial features for the generation. Furthermore, we propose an effective query mechanism based on implicit identity representations learned from the discrete keypoints of the source image. It can greatly facilitate the retrieval of more correlated information from the memory bank for the compensation. Extensive experiments demonstrate that MCNet can learn representative and complementary facial memory, and can clearly outperform previous state-of-the-art talking head generation methods on VoxCeleb1 and CelebV datasets. Please check our \href{https://github.com/harlanhong/ICCV2023-MCNET}{Project}.
LP-BFGS attack: An adversarial attack based on the Hessian with limited pixels
Zhang, Jiebao, Qian, Wenhua, Nie, Rencan, Cao, Jinde, Xu, Dan
Deep neural networks are vulnerable to adversarial attacks. Most $L_{0}$-norm based white-box attacks craft perturbations by the gradient of models to the input. Since the computation cost and memory limitation of calculating the Hessian matrix, the application of Hessian or approximate Hessian in white-box attacks is gradually shelved. In this work, we note that the sparsity requirement on perturbations naturally lends itself to the usage of Hessian information. We study the attack performance and computation cost of the attack method based on the Hessian with a limited number of perturbation pixels. Specifically, we propose the Limited Pixel BFGS (LP-BFGS) attack method by incorporating the perturbation pixel selection strategy and the BFGS algorithm. Pixels with top-k attribution scores calculated by the Integrated Gradient method are regarded as optimization variables of the LP-BFGS attack. Experimental results across different networks and datasets demonstrate that our approach has comparable attack ability with reasonable computation in different numbers of perturbation pixels compared with existing solutions.
Joint 2D-3D Multi-Task Learning on Cityscapes-3D: 3D Detection, Segmentation, and Depth Estimation
Ye, Hanrong, Xu, Dan
This report serves as a supplementary document for TaskPrompter, detailing its implementation on a new joint 2D-3D multi-task learning benchmark based on Cityscapes-3D. TaskPrompter presents an innovative multi-task prompting framework that unifies the learning of (i) task-generic representations, (ii) task-specific representations, and (iii) cross-task interactions, as opposed to previous approaches that separate these learning objectives into different network modules. This unified approach not only reduces the need for meticulous empirical structure design but also significantly enhances the multi-task network's representation learning capability, as the entire model capacity is devoted to optimizing the three objectives simultaneously. TaskPrompter introduces a new multi-task benchmark based on Cityscapes-3D dataset, which requires the multi-task model to concurrently generate predictions for monocular 3D vehicle detection, semantic segmentation, and monocular depth estimation. These tasks are essential for achieving a joint 2D-3D understanding of visual scenes, particularly in the development of autonomous driving systems. On this challenging benchmark, our multi-task model demonstrates strong performance compared to single-task state-of-the-art methods and establishes new state-of-the-art results on the challenging 3D detection and depth estimation tasks.
Edge Guided GANs with Contrastive Learning for Semantic Image Synthesis
Tang, Hao, Qi, Xiaojuan, Sun, Guolei, Xu, Dan, Sebe, Nicu, Timofte, Radu, Van Gool, Luc
We propose a novel edge guided generative adversarial network with contrastive learning (ECGAN) for the challenging semantic image synthesis task. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to three largely unresolved challenges. 1) The semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. However, they ignore "global" semantic information of multiple input semantic layouts, i.e., semantic cross-relations between pixels across different input layouts. To tackle 1), we propose to use edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. Edge information is produced by a convolutional generator and introduces detailed structure information. To tackle 2), we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout to preserve the semantic information. To tackle 3), inspired by current methods in contrastive learning, we propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content than those from different classes. Doing so can capture more semantic relations by explicitly exploring the structures of labeled pixels from multiple input semantic layouts. Experiments on three challenging datasets show that our ECGAN achieves significantly better results than state-of-the-art methods.
Exploring Adversarial Examples and Adversarial Robustness of Convolutional Neural Networks by Mutual Information
Zhang, Jiebao, Qian, Wenhua, Nie, Rencan, Cao, Jinde, Xu, Dan
A counter-intuitive property of convolutional neural networks (CNNs) is their inherent susceptibility to adversarial examples, which severely hinders the application of CNNs in security-critical fields. Adversarial examples are similar to original examples but contain malicious perturbations. Adversarial training is a simple and effective defense method to improve the robustness of CNNs to adversarial examples. The mechanisms behind adversarial examples and adversarial training are worth exploring. Therefore, this work investigates similarities and differences between normally trained CNNs (NT-CNNs) and adversarially trained CNNs (AT-CNNs) in information extraction from the mutual information perspective. We show that 1) whether NT-CNNs or AT-CNNs, for original and adversarial examples, the trends towards mutual information are almost similar throughout training; 2) compared with normal training, adversarial training is more difficult and the amount of information that AT-CNNs extract from the input is less; 3) the CNNs trained with different methods have different preferences for certain types of information; NT-CNNs tend to extract texture-based information from the input, while AT-CNNs prefer to shape-based information. The reason why adversarial examples mislead CNNs may be that they contain more texture-based information about other classes. Furthermore, we also analyze the mutual information estimators used in this work and find that they outline the geometric properties of the middle layer's output.
Lipschitz Continuity Retained Binary Neural Network
Shang, Yuzhang, Xu, Dan, Duan, Bin, Zong, Ziliang, Nie, Liqiang, Yan, Yan
Relying on the premise that the performance of a binary neural network can be largely restored with eliminated quantization error between full-precision weight vectors and their corresponding binary vectors, existing works of network binarization frequently adopt the idea of model robustness to reach the aforementioned objective. However, robustness remains to be an ill-defined concept without solid theoretical support. In this work, we introduce the Lipschitz continuity, a welldefined functional property, as the rigorous criteria to define the model robustness for BNN. We then propose to retain the Lipschitz continuity as a regularization term to improve the model robustness. Particularly, while the popular Lipschitz-involved regularization methods often collapse in BNN due to its extreme sparsity, we design the Retention Matrices to approximate spectral norms of the targeted weight matrices, which can be deployed as the approximation for the Lipschitz constant of BNNs without the exact Lipschitz constant computation (NP-hard). Our experiments prove that our BNN-specific regularization method can effectively enhance the robustness of BNN (testified on ImageNet-C), achieving SoTA on CIFAR10 and ImageNet. Our code is available at https://github.com/42Shawn/LCR_BNN. Keywords: Neural Network Compression, Network Binarization, Lipschitz Continuity
NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting
Chen, Kai, Chen, Guang, Xu, Dan, Zhang, Lijun, Huang, Yuyao, Knoll, Alois
Although Transformer has made breakthrough success in widespread domains especially in Natural Language Processing (NLP), applying it to time series forecasting is still a great challenge. In time series forecasting, the autoregressive decoding of canonical Transformer models could introduce huge accumulative errors inevitably. Besides, utilizing Transformer to deal with spatial-temporal dependencies in the problem still faces tough difficulties.~To tackle these limitations, this work is the first attempt to propose a Non-Autoregressive Transformer architecture for time series forecasting, aiming at overcoming the time delay and accumulative error issues in the canonical Transformer. Moreover, we present a novel spatial-temporal attention mechanism, building a bridge by a learned temporal influence map to fill the gaps between the spatial and temporal attention, so that spatial and temporal dependencies can be processed integrally. Empirically, we evaluate our model on diversified ego-centric future localization datasets and demonstrate state-of-the-art performance on both real-time and accuracy.