Wu, Xiao-Ming
Structure-aware Semantic Node Identifiers for Learning on Graphs
Luo, Yuankai, Liu, Qijiong, Shi, Lei, Wu, Xiao-Ming
We present a novel graph tokenization framework that generates structure-aware, semantic node identifiers (IDs) in the form of a short sequence of discrete codes, serving as symbolic representations of nodes. We employs vector quantization to compress continuous node embeddings from multiple layers of a graph neural network (GNN), into compact, meaningful codes, under both self-supervised and supervised learning paradigms. The resulting node IDs capture a high-level abstraction of graph data, enhancing the efficiency and interpretability of GNNs. Through extensive experiments on 34 datasets, including node classification, graph classification, link prediction, and attributed graph clustering tasks, we demonstrate that our generated node IDs not only improve computational efficiency but also achieve competitive performance compared to current state-of-the-art methods.
Dexterous Grasp Transformer
Xu, Guo-Hao, Wei, Yi-Lin, Zheng, Dian, Wu, Xiao-Ming, Zheng, Wei-Shi
In this work, we propose a novel discriminative framework for dexterous grasp generation, named Dexterous Grasp TRansformer (DGTR), capable of predicting a diverse set of feasible grasp poses by processing the object point cloud with only one forward pass. We formulate dexterous grasp generation as a set prediction task and design a transformer-based grasping model for it. However, we identify that this set prediction paradigm encounters several optimization challenges in the field of dexterous grasping and results in restricted performance. To address these issues, we propose progressive strategies for both the training and testing phases. First, the dynamic-static matching training (DSMT) strategy is presented to enhance the optimization stability during the training phase. Second, we introduce the adversarial-balanced test-time adaptation (AB-TTA) with a pair of adversarial losses to improve grasping quality during the testing phase. Experimental results on the DexGraspNet dataset demonstrate the capability of DGTR to predict dexterous grasp poses with both high quality and diversity. Notably, while keeping high quality, the diversity of grasp poses predicted by DGTR significantly outperforms previous works in multiple metrics without any data pre-processing. Codes are available at https://github.com/iSEE-Laboratory/DGTR .
VI-OOD: A Unified Representation Learning Framework for Textual Out-of-distribution Detection
Zhan, Li-Ming, Liu, Bo, Wu, Xiao-Ming
Out-of-distribution (OOD) detection plays a crucial role in ensuring the safety and reliability of deep neural networks in various applications. While there has been a growing focus on OOD detection in visual data, the field of textual OOD detection has received less attention. Only a few attempts have been made to directly apply general OOD detection methods to natural language processing (NLP) tasks, without adequately considering the characteristics of textual data. In this paper, we delve into textual OOD detection with Transformers. We first identify a key problem prevalent in existing OOD detection methods: the biased representation learned through the maximization of the conditional likelihood $p(y\mid x)$ can potentially result in subpar performance. We then propose a novel variational inference framework for OOD detection (VI-OOD), which maximizes the likelihood of the joint distribution $p(x, y)$ instead of $p(y\mid x)$. VI-OOD is tailored for textual OOD detection by efficiently exploiting the representations of pre-trained Transformers. Through comprehensive experiments on various text classification tasks, VI-OOD demonstrates its effectiveness and wide applicability. Our code has been released at \url{https://github.com/liam0949/LLM-OOD}.
Towards LLM-driven Dialogue State Tracking
Feng, Yujie, Lu, Zexin, Liu, Bo, Zhan, Liming, Wu, Xiao-Ming
Dialogue State Tracking (DST) is of paramount importance in ensuring accurate tracking of user goals and system actions within task-oriented dialogue systems. The emergence of large language models (LLMs) such as GPT3 and ChatGPT has sparked considerable interest in assessing their efficacy across diverse applications. In this study, we conduct an initial examination of ChatGPT's capabilities in DST. Our evaluation uncovers the exceptional performance of ChatGPT in this task, offering valuable insights to researchers regarding its capabilities and providing useful directions for designing and enhancing dialogue systems. Despite its impressive performance, ChatGPT has significant limitations including its closed-source nature, request restrictions, raising data privacy concerns, and lacking local deployment capabilities. To address these concerns, we present LDST, an LLM-driven DST framework based on smaller, open-source foundation models. By utilizing a novel domain-slot instruction tuning method, LDST achieves performance on par with ChatGPT. Comprehensive evaluations across three distinct experimental settings, we find that LDST exhibits remarkable performance improvements in both zero-shot and few-shot setting compared to previous SOTA methods. The source code is provided for reproducibility.
Making Multimodal Generation Easier: When Diffusion Models Meet LLMs
Zhao, Xiangyu, Liu, Bo, Liu, Qijiong, Shi, Guangyuan, Wu, Xiao-Ming
We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs). Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge the gap between modalities, EasyGen is built upon a bidirectional conditional diffusion model named BiDiffuser, which promotes more efficient interactions between modalities. EasyGen handles image-to-text generation by integrating BiDiffuser and an LLM via a simple projection layer. Unlike most existing multimodal models that are limited to generating text responses, EasyGen can also facilitate text-to-image generation by leveraging the LLM to create textual descriptions, which can be interpreted by BiDiffuser to generate appropriate visual responses. Extensive quantitative and qualitative experiments demonstrate the effectiveness of EasyGen, whose training can be easily achieved in a lab setting. The source code is available at https://github.com/zxy556677/EasyGen.
ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models
Liu, Qijiong, Chen, Nuo, Sakai, Tetsuya, Wu, Xiao-Ming
Personalized content-based recommender systems have become indispensable tools for users to navigate through the vast amount of content available on platforms like daily news websites and book recommendation services. However, existing recommenders face significant challenges in understanding the content of items. Large language models (LLMs), which possess deep semantic comprehension and extensive knowledge from pretraining, have proven to be effective in various natural language processing tasks. In this study, we explore the potential of leveraging both open- and closed-source LLMs to enhance content-based recommendation. With open-source LLMs, we utilize their deep layers as content encoders, enriching the representation of content at the embedding level. For closed-source LLMs, we employ prompting techniques to enrich the training data at the token level. Through comprehensive experiments, we demonstrate the high effectiveness of both types of LLMs and show the synergistic relationship between them. Notably, we observed a significant relative improvement of up to 19.32% compared to existing state-of-the-art recommendation models. These findings highlight the immense potential of both open- and closed-source of LLMs in enhancing content-based recommendation systems. We will make our code and LLM-generated data available for other researchers to reproduce our results.
Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark
Xu, Li, Liu, Bo, Khan, Ameer Hamza, Fan, Lu, Wu, Xiao-Ming
With the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, vision-language pre-training (VLP) has become an active area of research and proven to be effective for various VL tasks such as visual-question answering. However, studies on VLP in the medical domain have so far been scanty. To provide a comprehensive perspective on VLP for medical VL tasks, we conduct a thorough experimental analysis to study key factors that may affect the performance of VLP with a unified vision-language Transformer. To allow making sound and quick pre-training decisions, we propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs collected from an open-access online database MedPix. RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval. By utilizing RGC and other available datasets for pre-training, we develop several key insights that can guide future medical VLP research and new strong baselines for various medical VL tasks.
How Good Are Large Language Models at Out-of-Distribution Detection?
Liu, Bo, Zhan, Liming, Lu, Zexin, Feng, Yujie, Xue, Lei, Wu, Xiao-Ming
Out-of-distribution (OOD) detection plays a vital role in enhancing the reliability of machine learning (ML) models. The emergence of large language models (LLMs) has catalyzed a paradigm shift within the ML community, showcasing their exceptional capabilities across diverse natural language processing tasks. While existing research has probed OOD detection with relative small-scale Transformers like BERT, RoBERTa and GPT-2, the stark differences in scales, pre-training objectives, and inference paradigms call into question the applicability of these findings to LLMs. This paper embarks on a pioneering empirical investigation of OOD detection in the domain of LLMs, focusing on LLaMA series ranging from 7B to 65B in size. We thoroughly evaluate commonly-used OOD detectors, scrutinizing their performance in both zero-grad and fine-tuning scenarios. Notably, we alter previous discriminative in-distribution fine-tuning into generative fine-tuning, aligning the pre-training objective of LLMs with downstream tasks. Our findings unveil that a simple cosine distance OOD detector demonstrates superior efficacy, outperforming other OOD detectors. We provide an intriguing explanation for this phenomenon by highlighting the isotropic nature of the embedding spaces of LLMs, which distinctly contrasts with the anisotropic property observed in smaller BERT family models. The new insight enhances our understanding of how LLMs detect OOD data, thereby enhancing their adaptability and reliability in dynamic environments.
Neighborhood-based Hard Negative Mining for Sequential Recommendation
Fan, Lu, Pu, Jiashu, Zhang, Rongsheng, Wu, Xiao-Ming
Negative sampling plays a crucial role in training successful sequential recommendation models. Instead of merely employing random negative sample selection, numerous strategies have been proposed to mine informative negative samples to enhance training and performance. However, few of these approaches utilize structural information. In this work, we observe that as training progresses, the distributions of node-pair similarities in different groups with varying degrees of neighborhood overlap change significantly, suggesting that item pairs in distinct groups may possess different negative relationships. Motivated by this observation, we propose a Graph-based Negative sampling approach based on Neighborhood Overlap (GNNO) to exploit structural information hidden in user behaviors for negative mining. GNNO first constructs a global weighted item transition graph using training sequences. Subsequently, it mines hard negative samples based on the degree of overlap with the target item on the graph. Furthermore, GNNO employs curriculum learning to control the hardness of negative samples, progressing from easy to difficult. Extensive experiments on three Amazon benchmarks demonstrate GNNO's effectiveness in consistently enhancing the performance of various state-of-the-art models and surpassing existing negative sampling strategies. The code will be released at \url{https://github.com/floatSDSDS/GNNO}.
Revisit Few-shot Intent Classification with PLMs: Direct Fine-tuning vs. Continual Pre-training
Zhang, Haode, Liang, Haowen, Zhan, Liming, Wu, Xiao-Ming, Lam, Albert Y. S.
We consider the task of few-shot intent detection, which involves training a deep learning model to classify utterances based on their underlying intents using only a small amount of labeled data. The current approach to address this problem is through continual pre-training, i.e., fine-tuning pre-trained language models (PLMs) on external resources (e.g., conversational corpora, public intent detection datasets, or natural language understanding datasets) before using them as utterance encoders for training an intent classifier. In this paper, we show that continual pre-training may not be essential, since the overfitting problem of PLMs on this task may not be as serious as expected. Specifically, we find that directly fine-tuning PLMs on only a handful of labeled examples already yields decent results compared to methods that employ continual pre-training, and the performance gap diminishes rapidly as the number of labeled data increases. To maximize the utilization of the limited available data, we propose a context augmentation method and leverage sequential self-distillation to boost performance. Comprehensive experiments on real-world benchmarks show that given only two or more labeled samples per class, direct fine-tuning outperforms many strong baselines that utilize external data sources for continual pre-training. The code can be found at https://github.com/hdzhang-code/DFTPlus.