Zhong, Guo
Layer-Wise Feature Metric of Semantic-Pixel Matching for Few-Shot Learning
Tang, Hao, Lu, Junhao, Huang, Guoheng, Li, Ming, Chen, Xuhang, Zhong, Guo, Tan, Zhengguang, Li, Zinuo
Humans have the ability to abstract and generalize low-level visual elements, such as contours, edges, colors, textures, and shapes, to form high-level semantic features that aid in recognizing and understanding the similarities and differences between objects. This capability is particularly crucial in few-shot classification tasks, as it allows models to accurately identify and distinguish between different categories based on contrasting critical high-level semantic features, even when faced with a limited number of samples from new categories. In contrast, traditional deep learning methods [1, 2] typically rely on large amounts of labeled data for training in order to recognize and classify specific objects or concepts. In few-shot learning scenarios, these models may encounter challenges, as they are not specifically designed to learn from a small amount of data quickly. Recently, few-shot learning methods have been introduced to address this limitation, typically requiring only a few images to understand the characteristics of a class and generalize these features to unseen images for inductive reasoning.
IMAN: An Adaptive Network for Robust NPC Mortality Prediction with Missing Modalities
Huo, Yejing, Huang, Guoheng, Cheng, Lianglun, He, Jianbin, Chen, Xuhang, Yuan, Xiaochen, Zhong, Guo, Pun, Chi-Man
Accurate prediction of mortality in nasopharyngeal carcinoma (NPC), a complex malignancy particularly challenging in advanced stages, is crucial for optimizing treatment strategies and improving patient outcomes. However, this predictive process is often compromised by the high-dimensional and heterogeneous nature of NPC-related data, coupled with the pervasive issue of incomplete multi-modal data, manifesting as missing radiological images or incomplete diagnostic reports. Traditional machine learning approaches suffer significant performance degradation when faced with such incomplete data, as they fail to effectively handle the high-dimensionality and intricate correlations across modalities. Even advanced multi-modal learning techniques like Transformers struggle to maintain robust performance in the presence of missing modalities, as they lack specialized mechanisms to adaptively integrate and align the diverse data types, while also capturing nuanced patterns and contextual relationships within the complex NPC data. To address these problem, we introduce IMAN: an adaptive network for robust NPC mortality prediction with missing modalities.
TAGE: Trustworthy Attribute Group Editing for Stable Few-shot Image Generation
Zhang, Ruicheng, Huang, Guoheng, Huo, Yejing, Yuan, Xiaochen, Zhou, Zhizhen, Chen, Xuhang, Zhong, Guo
Generative Adversarial Networks (GANs) have emerged as a prominent research focus for image editing tasks, leveraging the powerful image generation capabilities of the GAN framework to produce remarkable results.However, prevailing approaches are contingent upon extensive training datasets and explicit supervision, presenting a significant challenge in manipulating the diverse attributes of new image classes with limited sample availability. To surmount this hurdle, we introduce TAGE, an innovative image generation network comprising three integral modules: the Codebook Learning Module (CLM), the Code Prediction Module (CPM) and the Prompt-driven Semantic Module (PSM). The CPM module delves into the semantic dimensions of category-agnostic attributes, encapsulating them within a discrete codebook. This module is predicated on the concept that images are assemblages of attributes, and thus, by editing these category-independent attributes, it is theoretically possible to generate images from unseen categories. Subsequently, the CPM module facilitates naturalistic image editing by predicting indices of category-independent attribute vectors within the codebook. Additionally, the PSM module generates semantic cues that are seamlessly integrated into the Transformer architecture of the CPM, enhancing the model's comprehension of the targeted attributes for editing. With these semantic cues, the model can generate images that accentuate desired attributes more prominently while maintaining the integrity of the original category, even with a limited number of samples. We have conducted extensive experiments utilizing the Animal Faces, Flowers, and VGGFaces datasets. The results of these experiments demonstrate that our proposed method not only achieves superior performance but also exhibits a high degree of stability when compared to other few-shot image generation techniques.
Medical Visual Prompting (MVP): A Unified Framework for Versatile and High-Quality Medical Image Segmentation
Chen, Yulin, Huang, Guoheng, Huang, Kai, Lin, Zijin, Zhong, Guo, Luo, Shenghong, Deng, Jie, Zhou, Jian
Accurate segmentation of lesion regions is crucial for clinical diagnosis and treatment across various diseases. While deep convolutional networks have achieved satisfactory results in medical image segmentation, they face challenges such as loss of lesion shape information due to continuous convolution and downsampling, as well as the high cost of manually labeling lesions with varying shapes and sizes. To address these issues, we propose a novel medical visual prompting (MVP) framework that leverages pre-training and prompting concepts from natural language processing (NLP). The framework utilizes three key components: Super-Pixel Guided Prompting (SPGP) for superpixelating the input image, Image Embedding Guided Prompting (IEGP) for freezing patch embedding and merging with superpixels to provide visual prompts, and Adaptive Attention Mechanism Guided Prompting (AAGP) for pinpointing prompt content and efficiently adapting all layers. By integrating SPGP, IEGP, and AAGP, the MVP enables the segmentation network to better learn shape prompting information and facilitates mutual learning across different tasks. Extensive experiments conducted on five datasets demonstrate superior performance of this method in various challenging medical image tasks, while simplifying single-task medical segmentation models. This novel framework offers improved performance with fewer parameters and holds significant potential for accurate segmentation of lesion regions in various medical tasks, making it clinically valuable.
RBA-GCN: Relational Bilevel Aggregation Graph Convolutional Network for Emotion Recognition
Yuan, Lin, Huang, Guoheng, Li, Fenghuan, Yuan, Xiaochen, Pun, Chi-Man, Zhong, Guo
Emotion recognition in conversation (ERC) has received increasing attention from researchers due to its wide range of applications.As conversation has a natural graph structure,numerous approaches used to model ERC based on graph convolutional networks (GCNs) have yielded significant results.However,the aggregation approach of traditional GCNs suffers from the node information redundancy problem,leading to node discriminant information loss.Additionally,single-layer GCNs lack the capacity to capture long-range contextual information from the graph. Furthermore,the majority of approaches are based on textual modality or stitching together different modalities, resulting in a weak ability to capture interactions between modalities. To address these problems, we present the relational bilevel aggregation graph convolutional network (RBA-GCN), which consists of three modules: the graph generation module (GGM), similarity-based cluster building module (SCBM) and bilevel aggregation module (BiAM). First, GGM constructs a novel graph to reduce the redundancy of target node information.Then,SCBM calculates the node similarity in the target node and its structural neighborhood, where noisy information with low similarity is filtered out to preserve the discriminant information of the node. Meanwhile, BiAM is a novel aggregation method that can preserve the information of nodes during the aggregation process. This module can construct the interaction between different modalities and capture long-range contextual information based on similarity clusters. On both the IEMOCAP and MELD datasets, the weighted average F1 score of RBA-GCN has a 2.17$\sim$5.21\% improvement over that of the most advanced method.Our code is available at https://github.com/luftmenscher/RBA-GCN and our article with the same name has been published in IEEE/ACM Transactions on Audio,Speech,and Language Processing,vol.31,2023