Zhu, Wentao
TPC-ViT: Token Propagation Controller for Efficient Vision Transformer
Zhu, Wentao
Vision transformers (ViTs) have achieved promising results on a variety of Computer Vision tasks, however their quadratic complexity in the number of input tokens has limited their application specially in resource-constrained settings. Previous approaches that employ gradual token reduction to address this challenge assume that token redundancy in one layer implies redundancy in all the following layers. We empirically demonstrate that this assumption is often not correct, i.e., tokens that are redundant in one layer can be useful in later layers. We employ this key insight to propose a novel token propagation controller (TPC) that incorporates two different token-distributions, i.e., pause probability and restart probability to control the reduction and reuse of tokens respectively, which results in more efficient token utilization. To improve the estimates of token distributions, we propose a smoothing mechanism that acts as a regularizer and helps remove noisy outliers. Furthermore, to improve the training-stability of our proposed TPC, we introduce a model stabilizer that is able to implicitly encode local image structures and minimize accuracy fluctuations during model training. We present extensive experimental results on the ImageNet-1K dataset using DeiT, LV-ViT and Swin models to demonstrate the effectiveness of our proposed method. For example, compared to baseline models, our proposed method improves the inference speed of the DeiT-S by 250% while increasing the classification accuracy by 1.0%.
Deformable Audio Transformer for Audio Event Detection
Zhu, Wentao
Transformers have achieved promising results on a variety of tasks. However, the quadratic complexity in self-attention computation has limited the applications, especially in low-resource settings and mobile or edge devices. Existing works have proposed to exploit hand-crafted attention patterns to reduce computation complexity. However, such hand-crafted patterns are data-agnostic and may not be optimal. Hence, it is likely that relevant keys or values are being reduced, while less important ones are still preserved. Based on this key insight, we propose a novel deformable audio Transformer for audio recognition, named DATAR, where a deformable attention equipping with a pyramid transformer backbone is constructed and learnable. Such an architecture has been proven effective in prediction tasks,~\textit{e.g.}, event classification. Moreover, we identify that the deformable attention map computation may over-simplify the input feature, which can be further enhanced. Hence, we introduce a learnable input adaptor to alleviate this issue, and DATAR achieves state-of-the-art performance.
A Multi-Scale Spatial Transformer U-Net for Simultaneously Automatic Reorientation and Segmentation of 3D Nuclear Cardiac Images
Ni, Yangfan, Zhang, Duo, Ma, Gege, Lu, Lijun, Huang, Zhongke, Zhu, Wentao
Accurate reorientation and segmentation of the left ventricular (LV) is essential for the quantitative analysis of myocardial perfusion imaging (MPI), in which one critical step is to reorient the reconstructed transaxial nuclear cardiac images into standard short-axis slices for subsequent image processing. Small-scale LV myocardium (LV-MY) region detection and the diverse cardiac structures of individual patients pose challenges to LV segmentation operation. To mitigate these issues, we propose an end-to-end model, named as multi-scale spatial transformer UNet (MS-ST-UNet), that involves the multi-scale spatial transformer network (MSSTN) and multi-scale UNet (MSUNet) modules to perform simultaneous reorientation and segmentation of LV region from nuclear cardiac images. The proposed method is trained and tested using two different nuclear cardiac image modalities: 13N-ammonia PET and 99mTc-sestamibi SPECT. We use a multi-scale strategy to generate and extract image features with different scales. Our experimental results demonstrate that the proposed method significantly improves the reorientation and segmentation performance. This joint learning framework promotes mutual enhancement between reorientation and segmentation tasks, leading to cutting edge performance and an efficient image processing workflow. The proposed end-to-end deep network has the potential to reduce the burden of manual delineation for cardiac images, thereby providing multimodal quantitative analysis assistance for physicists.
Classification of lung cancer subtypes on CT images with synthetic pathological priors
Zhu, Wentao, Jin, Yuan, Ma, Gege, Chen, Geng, Egger, Jan, Zhang, Shaoting, Metaxas, Dimitris N.
The accurate diagnosis on pathological subtypes for lung cancer is of significant importance for the follow-up treatments and prognosis managements. In this paper, we propose self-generating hybrid feature network (SGHF-Net) for accurately classifying lung cancer subtypes on computed tomography (CT) images. Inspired by studies stating that cross-scale associations exist in the image patterns between the same case's CT images and its pathological images, we innovatively developed a pathological feature synthetic module (PFSM), which quantitatively maps cross-modality associations through deep neural networks, to derive the "gold standard" information contained in the corresponding pathological images from CT images. Additionally, we designed a radiological feature extraction module (RFEM) to directly acquire CT image information and integrated it with the pathological priors under an effective feature fusion framework, enabling the entire classification model to generate more indicative and specific pathologically related features and eventually output more accurate predictions. The superiority of the proposed model lies in its ability to self-generate hybrid features that contain multi-modality image information based on a single-modality input. To evaluate the effectiveness, adaptability, and generalization ability of our model, we performed extensive experiments on a large-scale multi-center dataset (i.e., 829 cases from three hospitals) to compare our model and a series of state-of-the-art (SOTA) classification models. The experimental results demonstrated the superiority of our model for lung cancer subtypes classification with significant accuracy improvements in terms of accuracy (ACC), area under the curve (AUC), and F1 score.
AutoShot: A Short Video Dataset and State-of-the-Art Shot Boundary Detection
Zhu, Wentao, Huang, Yufang, Xie, Xiufeng, Liu, Wenxian, Deng, Jincan, Zhang, Debing, Wang, Zhangyang, Liu, Ji
The short-form videos have explosive popularity and have dominated the new social media trends. Prevailing short-video platforms,~\textit{e.g.}, Kuaishou (Kwai), TikTok, Instagram Reels, and YouTube Shorts, have changed the way we consume and create content. For video content creation and understanding, the shot boundary detection (SBD) is one of the most essential components in various scenarios. In this work, we release a new public Short video sHot bOundary deTection dataset, named SHOT, consisting of 853 complete short videos and 11,606 shot annotations, with 2,716 high quality shot boundary annotations in 200 test videos. Leveraging this new data wealth, we propose to optimize the model design for video SBD, by conducting neural architecture search in a search space encapsulating various advanced 3D ConvNets and Transformers. Our proposed approach, named AutoShot, achieves higher F1 scores than previous state-of-the-art approaches, e.g., outperforming TransNetV2 by 4.2%, when being derived and evaluated on our newly constructed SHOT dataset. Moreover, to validate the generalizability of the AutoShot architecture, we directly evaluate it on another three public datasets: ClipShots, BBC and RAI, and the F1 scores of AutoShot outperform previous state-of-the-art approaches by 1.1%, 0.9% and 1.2%, respectively. The SHOT dataset and code can be found in https://github.com/wentaozhu/AutoShot.git .
Multiscale Audio Spectrogram Transformer for Efficient Audio Classification
Zhu, Wentao, Omar, Mohamed
Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions. MAST significantly outperforms AST~\cite{gong2021ast} by 22.2\%, 4.4\% and 4.7\% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. On the downloaded AudioSet dataset, which has over 20\% missing audios, MAST also achieves slightly better accuracy than AST. In addition, MAST is 5x more efficient in terms of multiply-accumulates (MACs) with 42\% reduction in the number of parameters compared to AST. Through clustering metrics and visualizations, we demonstrate that the proposed MAST can learn semantically more separable feature representations from audio signals.
Intelligent Computing: The Latest Advances, Challenges and Future
Zhu, Shiqiang, Yu, Ting, Xu, Tao, Chen, Hongyang, Dustdar, Schahram, Gigan, Sylvain, Gunduz, Deniz, Hossain, Ekram, Jin, Yaochu, Lin, Feng, Liu, Bo, Wan, Zhiguo, Zhang, Ji, Zhao, Zhifeng, Zhu, Wentao, Chen, Zuoning, Durrani, Tariq, Wang, Huaimin, Wu, Jiangxing, Zhang, Tongyi, Pan, Yunhe
Computing is a critical driving force in the development of human civilization. In recent years, we have witnessed the emergence of intelligent computing, a new computing paradigm that is reshaping traditional computing and promoting digital revolution in the era of big data, artificial intelligence and internet-of-things with new computing theories, architectures, methods, systems, and applications. Intelligent computing has greatly broadened the scope of computing, extending it from traditional computing on data to increasingly diverse computing paradigms such as perceptual intelligence, cognitive intelligence, autonomous intelligence, and human-computer fusion intelligence. Intelligence and computing have undergone paths of different evolution and development for a long time but have become increasingly intertwined in recent years: intelligent computing is not only intelligence-oriented but also intelligence-driven. Such cross-fertilization has prompted the emergence and rapid advancement of intelligent computing. Intelligent computing is still in its infancy and an abundance of innovations in the theories, systems, and applications of intelligent computing are expected to occur soon. We present the first comprehensive survey of literature on intelligent computing, covering its theory fundamentals, the technological fusion of intelligence and computing, important applications, challenges, and future perspectives. We believe that this survey is highly timely and will provide a comprehensive reference and cast valuable insights into intelligent computing for academic and industrial researchers and practitioners.
SpeechNAS: Towards Better Trade-off between Latency and Accuracy for Large-Scale Speaker Verification
Zhu, Wentao, Kong, Tianlong, Lu, Shun, Li, Jixiang, Zhang, Dawei, Deng, Feng, Wang, Xiaorui, Yang, Sen, Liu, Ji
Recently, x-vector has been a successful and popular approach for speaker verification, which employs a time delay neural network (TDNN) and statistics pooling to extract speaker characterizing embedding from variable-length utterances. Improvement upon the x-vector has been an active research area, and enormous neural networks have been elaborately designed based on the x-vector, eg, extended TDNN (E-TDNN), factorized TDNN (F-TDNN), and densely connected TDNN (D-TDNN). In this work, we try to identify the optimal architectures from a TDNN based search space employing neural architecture search (NAS), named SpeechNAS. Leveraging the recent advances in the speaker recognition, such as high-order statistics pooling, multi-branch mechanism, D-TDNN and angular additive margin softmax (AAM) loss with a minimum hyper-spherical energy (MHE), SpeechNAS automatically discovers five network architectures, from SpeechNAS-1 to SpeechNAS-5, of various numbers of parameters and GFLOPs on the large-scale text-independent speaker recognition dataset VoxCeleb1. Our derived best neural network achieves an equal error rate (EER) of 1.02% on the standard test set of VoxCeleb1, which surpasses previous TDNN based state-of-the-art approaches by a large margin. Code and trained weights are in https://github.com/wentaozhu/speechnas.git
Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks
Zhu, Wentao (University of California, Irvine) | Lan, Cuiling (Microsoft Research Asia) | Xing, Junliang (Institute of Automation, Chinese Academy of Sciences) | Zeng, Wenjun (Microsoft Research Asia) | Li, Yanghao (Peking University) | Shen, Li (University of Chinese Academy of Sciences) | Xie, Xiaohui (University of California, Irvine)
Skeleton based action recognition distinguishes human actions using the trajectories of skeleton joints, which provide a very good representation for describing actions. Considering that recurrent neural networks (RNNs) with Long Short-Term Memory (LSTM) can learn feature representations and model long-term temporal dependencies automatically, we propose an end-to-end fully connected deep LSTM network for skeleton based action recognition. Inspired by the observation that the co-occurrences of the joints intrinsically characterize human actions, we take the skeleton as the input at each time slot and introduce a novel regularization scheme to learn the co-occurrence features of skeleton joints. To train the deep LSTM network effectively, we propose a new dropout algorithm which simultaneously operates on the gates, cells, and output responses of the LSTM neurons. Experimental results on three human action recognition datasets consistently demonstrate the effectiveness of the proposed model.