Kang, Wenxiong
Pseudo-label Refinement for Improving Self-Supervised Learning Systems
Zia-ur-Rehman, null, Mahmood, Arif, Kang, Wenxiong
Self-supervised learning systems have gained significant attention in recent years by leveraging clustering-based pseudo-labels to provide supervision without the need for human annotations. However, the noise in these pseudo-labels caused by the clustering methods poses a challenge to the learning process leading to degraded performance. In this work, we propose a pseudo-label refinement (SLR) algorithm to address this issue. The cluster labels from the previous epoch are projected to the current epoch cluster-labels space and a linear combination of the new label and the projected label is computed as a soft refined label containing the information from the previous epoch clusters as well as from the current epoch. In contrast to the common practice of using the maximum value as a cluster/class indicator, we employ hierarchical clustering on these soft pseudo-labels to generate refined hard-labels. This approach better utilizes the information embedded in the soft labels, outperforming the simple maximum value approach for hard label generation. The effectiveness of the proposed SLR algorithm is evaluated in the context of person re-identification (Re-ID) using unsupervised domain adaptation (UDA). Experimental results demonstrate that the modified Re-ID baseline, incorporating the SLR algorithm, achieves significantly improved mean Average Precision (mAP) performance in various UDA tasks, including real-to-synthetic, synthetic-to-real, and different real-to-real scenarios. These findings highlight the efficacy of the SLR algorithm in enhancing the performance of self-supervised learning systems.
Improving 3D Finger Traits Recognition via Generalizable Neural Rendering
Xu, Hongbin, Huang, Junduan, Ma, Yuer, Li, Zifeng, Kang, Wenxiong
3D biometric techniques on finger traits have become a new trend and have demonstrated a powerful ability for recognition and anti-counterfeiting. Existing methods follow an explicit 3D pipeline that reconstructs the models first and then extracts features from 3D models. However, these explicit 3D methods suffer from the following problems: 1) Inevitable information dropping during 3D reconstruction; 2) Tight coupling between specific hardware and algorithm for 3D reconstruction. It leads us to a question: Is it indispensable to reconstruct 3D information explicitly in recognition tasks? Hence, we consider this problem in an implicit manner, leaving the nerve-wracking 3D reconstruction problem for learnable neural networks with the help of neural radiance fields (NeRFs). We propose FingerNeRF, a novel generalizable NeRF for 3D finger biometrics. To handle the shape-radiance ambiguity problem that may result in incorrect 3D geometry, we aim to involve extra geometric priors based on the correspondence of binary finger traits like fingerprints or finger veins. First, we propose a novel Trait Guided Transformer (TGT) module to enhance the feature correspondence with the guidance of finger traits. Second, we involve extra geometric constraints on the volume rendering loss with the proposed Depth Distillation Loss and Trait Guided Rendering Loss. To evaluate the performance of the proposed method on different modalities, we collect two new datasets: SCUT-Finger-3D with finger images and SCUT-FingerVein-3D with finger vein images. Moreover, we also utilize the UNSW-3D dataset with fingerprint images for evaluation. In experiments, our FingerNeRF can achieve 4.37% EER on SCUT-Finger-3D dataset, 8.12% EER on SCUT-FingerVein-3D dataset, and 2.90% EER on UNSW-3D dataset, showing the superiority of the proposed implicit method in 3D finger biometrics.
ControLRM: Fast and Controllable 3D Generation via Large Reconstruction Model
Xu, Hongbin, Chen, Weitao, Zhou, Zhipeng, Xiao, Feng, Sun, Baigui, Shou, Mike Zheng, Kang, Wenxiong
Despite recent advancements in 3D generation methods, achieving controllability still remains a challenging issue. Current approaches utilizing score-distillation sampling are hindered by laborious procedures that consume a significant amount of time. Furthermore, the process of first generating 2D representations and then mapping them to 3D lacks internal alignment between the two forms of representation. To address these challenges, we introduce ControLRM, an end-to-end feed-forward model designed for rapid and controllable 3D generation using a large reconstruction model (LRM). ControLRM comprises a 2D condition generator, a condition encoding transformer, and a triplane decoder transformer. Instead of training our model from scratch, we advocate for a joint training framework. In the condition training branch, we lock the triplane decoder and reuses the deep and robust encoding layers pretrained with millions of 3D data in LRM. In the image training branch, we unlock the triplane decoder to establish an implicit alignment between the 2D and 3D representations. To ensure unbiased evaluation, we curate evaluation samples from three distinct datasets (G-OBJ, GSO, ABO) rather than relying on cherry-picking manual generation. The comprehensive experiments conducted on quantitative and qualitative comparisons of 3D controllability and generation quality demonstrate the strong generalization capacity of our proposed approach.
Semi-supervised Deep Multi-view Stereo
Xu, Hongbin, Chen, Weitao, Liu, Yang, Zhou, Zhipeng, Xiao, Haihong, Sun, Baigui, Xie, Xuansong, Kang, Wenxiong
Significant progress has been witnessed in learning-based Multi-view Stereo (MVS) under supervised and unsupervised settings. To combine their respective merits in accuracy and completeness, meantime reducing the demand for expensive labeled data, this paper explores the problem of learning-based MVS in a semi-supervised setting that only a tiny part of the MVS data is attached with dense depth ground truth. However, due to huge variation of scenarios and flexible settings in views, it may break the basic assumption in classic semi-supervised learning, that unlabeled data and labeled data share the same label space and data distribution, named as semi-supervised distribution-gap ambiguity in the MVS problem. To handle these issues, we propose a novel semi-supervised distribution-augmented MVS framework, namely SDA-MVS. For the simple case that the basic assumption works in MVS data, consistency regularization encourages the model predictions to be consistent between original sample and randomly augmented sample. For further troublesome case that the basic assumption is conflicted in MVS data, we propose a novel style consistency loss to alleviate the negative effect caused by the distribution gap. The visual style of unlabeled sample is transferred to labeled sample to shrink the gap, and the model prediction of generated sample is further supervised with the label in original labeled sample. The experimental results in semi-supervised settings of multiple MVS datasets show the superior performance of the proposed method. With the same settings in backbone network, our proposed SDA-MVS outperforms its fully-supervised and unsupervised baselines.
A Novel Transfer Learning Method Utilizing Acoustic and Vibration Signals for Rotating Machinery Fault Diagnosis
Chen, Zhongliang, Huang, Zhuofei, Kang, Wenxiong
Fault diagnosis of rotating machinery plays a important role for the safety and stability of modern industrial systems. However, there is a distribution discrepancy between training data and data of real-world operation scenarios, which causing the decrease of performance of existing systems. This paper proposed a transfer learning based method utilizing acoustic and vibration signal to address this distribution discrepancy. We designed the acoustic and vibration feature fusion MAVgram to offer richer and more reliable information of faults, coordinating with a DNN-based classifier to obtain more effective diagnosis representation. The backbone was pre-trained and then fine-tuned to obtained excellent performance of the target task. Experimental results demonstrate the effectiveness of the proposed method, and achieved improved performance compared to STgram-MFN.
Self-supervised Multi-view Stereo via Effective Co-Segmentation and Data-Augmentation
Xu, Hongbin, Zhou, Zhipeng, Qiao, Yu, Kang, Wenxiong, Wu, Qiuxia
Recent studies have witnessed that self-supervised methods based on view synthesis obtain clear progress on multi-view stereo (MVS). However, existing methods rely on the assumption that the corresponding points among different views share the same color, which may not always be true in practice. This may lead to unreliable self-supervised signal and harm the final reconstruction performance. To address the issue, we propose a framework integrated with more reliable supervision guided by semantic co-segmentation and data-augmentation. Specially, we excavate mutual semantic from multi-view images to guide the semantic consistency. And we devise effective data-augmentation mechanism which ensures the transformation robustness by treating the prediction of regular samples as pseudo ground truth to regularize the prediction of augmented samples. Experimental results on DTU dataset show that our proposed methods achieve the state-of-the-art performance among unsupervised methods, and even compete on par with supervised methods. Furthermore, extensive experiments on Tanks&Temples dataset demonstrate the effective generalization ability of the proposed method.