King, Irwin
Progress and Opportunities of Foundation Models in Bioinformatics
Li, Qing, Hu, Zhihang, Wang, Yixuan, Li, Lei, Fan, Yimin, King, Irwin, Song, Le, Li, Yu
Bioinformatics has witnessed a paradigm shift with the increasing integration of artificial intelligence (AI), particularly through the adoption of foundation models (FMs). These AI techniques have rapidly advanced, addressing historical challenges in bioinformatics such as the scarcity of annotated data and the presence of data noise. FMs are particularly adept at handling large-scale, unlabeled data, a common scenario in biological contexts due to the time-consuming and costly nature of experimentally determining labeled data. This characteristic has allowed FMs to excel and achieve notable results in various downstream validation tasks, demonstrating their ability to represent diverse biological entities effectively. Undoubtedly, FMs have ushered in a new era in computational biology, especially in the realm of deep learning. The primary goal of this survey is to conduct a systematic investigation and summary of FMs in bioinformatics, tracing their evolution, current research status, and the methodologies employed. Central to our focus is the application of FMs to specific biological problems, aiming to guide the research community in choosing appropriate FMs for their research needs. We delve into the specifics of the problem at hand including sequence analysis, structure prediction, function annotation, and multimodal integration, comparing the structures and advancements against traditional methods. Furthermore, the review analyses challenges and limitations faced by FMs in biology, such as data noise, model explainability, and potential biases. Finally, we outline potential development paths and strategies for FMs in future biological research, setting the stage for continued innovation and application in this rapidly evolving field. This comprehensive review serves not only as an academic resource but also as a roadmap for future explorations and applications of FMs in biology.
A Survey of Text Watermarking in the Era of Large Language Models
Liu, Aiwei, Pan, Leyi, Lu, Yijian, Li, Jingjing, Hu, Xuming, Zhang, Xi, Wen, Lijie, King, Irwin, Xiong, Hui, Yu, Philip S.
Text watermarking algorithms play a crucial role in the copyright protection of textual content, yet their capabilities and application scenarios have been limited historically. The recent developments in large language models (LLMs) have opened new opportunities for the advancement of text watermarking techniques. LLMs not only enhance the capabilities of text watermarking algorithms through their text understanding and generation abilities but also necessitate the use of text watermarking algorithms for their own copyright protection. This paper conducts a comprehensive survey of the current state of text watermarking technology, covering four main aspects: (1) an overview and comparison of different text watermarking techniques; (2) evaluation methods for text watermarking algorithms, including their success rates, impact on text quality, robustness, and unforgeability; (3) potential application scenarios for text watermarking technology; (4) current challenges and future directions for development. This survey aims to provide researchers with a thorough understanding of text watermarking technology, thereby promoting its further advancement.
An Unforgeable Publicly Verifiable Watermark for Large Language Models
Liu, Aiwei, Pan, Leyi, Hu, Xuming, Li, Shu'ang, Wen, Lijie, King, Irwin, Yu, Philip S.
However, current watermark detection algorithms require the secret key used in the watermark generation process, making them susceptible to security breaches and counterfeiting during public detection. To address this limitation, we propose an unforgeable publicly verifiable watermark algorithm that uses two different neural networks for watermark generation and detection, instead of using the same key at both stages. Meanwhile, the token embedding parameters are shared between the generation and detection networks, which makes the detection network achieve a high accuracy very efficiently. Experiments demonstrate that our algorithm attains high detection accuracy and computational efficiency through neural networks with a minimized number of parameters. Subsequent analysis confirms the high complexity involved in forging the watermark from the detection network. Therefore, texts generated by LLMs need to be detected and tagged. At present, some watermarking algorithms for LLM have proved successful in making machinegenerated texts detectable by adding implicit features during the text generation process that are difficult for humans to discover but easily detected by the specially designed method (Christ et al., 2023; Kirchenbauer et al., 2023). The current watermark algorithms for large models utilize a shared key during the generation and detection of watermarks. They work well when the detection access is restricted to the watermark owner only. However, in many situations, when third-party watermark detection is required, the exposure of the shared key would enable others to forge the watermark. Therefore, preventing the watermark forge in the public detection setting, is of great importance. In this work, we propose the first unforgeable publicly verifiable watermarking algorithm for large language models (LLMs).
MECCH: Metapath Context Convolution-based Heterogeneous Graph Neural Networks
Fu, Xinyu, King, Irwin
Heterogeneous graph neural networks (HGNNs) were proposed for representation learning on structural data with multiple types of nodes and edges. To deal with the performance degradation issue when HGNNs become deep, researchers combine metapaths into HGNNs to associate nodes closely related in semantics but far apart in the graph. However, existing metapath-based models suffer from either information loss or high computation costs. To address these problems, we present a novel Metapath Context Convolution-based Heterogeneous Graph Neural Network (MECCH). MECCH leverages metapath contexts, a new kind of graph structure that facilitates lossless node information aggregation while avoiding any redundancy. Specifically, MECCH applies three novel components after feature preprocessing to extract comprehensive information from the input graph efficiently: (1) metapath context construction, (2) metapath context encoder, and (3) convolutional metapath fusion. Experiments on five real-world heterogeneous graph datasets for node classification and link prediction show that MECCH achieves superior prediction accuracy compared with state-of-the-art baselines with improved computational efficiency.
Alignment and Outer Shell Isotropy for Hyperbolic Graph Contrastive Learning
Zhang, Yifei, Zhu, Hao, Liu, Jiahong, Koniusz, Piotr, King, Irwin
Learning good self-supervised graph representations that are beneficial to downstream tasks is challenging. Among a variety of methods, contrastive learning enjoys competitive performance. The embeddings of contrastive learning are arranged on a hypersphere that enables the Cosine distance measurement in the Euclidean space. However, the underlying structure of many domains such as graphs exhibits highly non-Euclidean latent geometry. To this end, we propose a novel contrastive learning framework to learn high-quality graph embedding. Specifically, we design the alignment metric that effectively captures the hierarchical data-invariant information, as well as we propose a substitute of uniformity metric to prevent the so-called dimensional collapse. We show that in the hyperbolic space one has to address the leaf- and height-level uniformity which are related to properties of trees, whereas in the ambient space of the hyperbolic manifold, these notions translate into imposing an isotropic ring density towards boundaries of Poincar\'e ball. This ring density can be easily imposed by promoting the isotropic feature distribution on the tangent space of manifold. In the experiments, we demonstrate the efficacy of our proposed method across different hyperbolic graph embedding techniques in both supervised and self-supervised learning settings.
Hyperbolic Graph Neural Networks: A Review of Methods and Applications
Yang, Menglin, Zhou, Min, Li, Zhihao, Liu, Jiahong, Pan, Lujia, Xiong, Hui, King, Irwin
Graphs are data structures that extensively exist in real-world complex systems, varying from social networks [15, 62], protein interaction networks [52], recommender systems [9, 65, 64], knowledge graphs [56], to the financial transaction systems [40]. They form the basis of innumerable systems owing to their widespread utilization, allowing relational knowledge about interacting entities to be stored and accessible rapidly. Consequently, graph-related learning tasks gain increasing attention in machine learning and network science research. Many researchers have applied Graph Neural Networks (GNNs) for a variety of tasks, including node classification [23, 53, 59], link prediction [22, 71], and graph classification [61, 11] by embedding nodes in low-dimensional vector spaces, encoding topological and semantic information simultaneously. Many GNNs are built in Euclidean space in that it feature a vectorial structure, closed-form distance and inner-product formulae and is a natural extension of our intuitively appealing visual three-dimensional space [14]. Despite the effectiveness of Euclidean space for graph-related learning tasks, its ability to encode complex patterns is intrinsically limited by its polynomially expanding capacity. Although nonlinear techniques [3] assist in mitigating this issue, complex graph patterns may still need an embedding dimensionality that is computationally intractable. As revealed by recent research [4] many complex data show non-Euclidean underlying anatomy. For example, the tree-like structure extensively exists in many real-world networks, such as the hypernym structure in natural languages, the subordinate structure of entities in the knowledge graph, the organizational structure for financial fraud, and the power-law distribution in recommender systems.
Momentum Contrastive Pre-training for Question Answering
Hu, Minda, Li, Muzhi, Wang, Yasheng, King, Irwin
Existing pre-training methods for extractive Question Answering (QA) generate cloze-like queries different from natural questions in syntax structure, which could overfit pre-trained models to simple keyword matching. In order to address this problem, we propose a novel Momentum Contrastive pRe-training fOr queStion anSwering (MCROSS) method for extractive QA. Specifically, MCROSS introduces a momentum contrastive learning framework to align the answer probability between cloze-like and natural query-passage sample pairs. Hence, the pre-trained models can better transfer the knowledge learned in cloze-like samples to answering natural questions. Experimental results on three benchmarking QA datasets show that our method achieves noticeable improvement compared with all baselines in both supervised and zero-shot scenarios.
Large Language Models as Source Planner for Personalized Knowledge-grounded Dialogue
Wang, Hongru, Hu, Minda, Deng, Yang, Wang, Rui, Mi, Fei, Wang, Weichao, Wang, Yasheng, Kwan, Wai-Chung, King, Irwin, Wong, Kam-Fai
Open-domain dialogue system usually requires different sources of knowledge to generate more informative and evidential responses. However, existing knowledge-grounded dialogue systems either focus on a single knowledge source or overlook the dependency between multiple sources of knowledge, which may result in generating inconsistent or even paradoxical responses. To incorporate multiple knowledge sources and dependencies between them, we propose SAFARI, a novel framework that leverages the exceptional capabilities of large language models (LLMs) in planning, understanding, and incorporating under both supervised and unsupervised settings. Specifically, SAFARI decouples the knowledge grounding into multiple sources and response generation, which allows easy extension to various knowledge sources including the possibility of not using any sources. To study the problem, we construct a personalized knowledge-grounded dialogue dataset \textit{\textbf{K}nowledge \textbf{B}ehind \textbf{P}ersona}~(\textbf{KBP}), which is the first to consider the dependency between persona and implicit knowledge. Experimental results on the KBP dataset demonstrate that the SAFARI framework can effectively produce persona-consistent and knowledge-enhanced responses.
ResNorm: Tackling Long-tailed Degree Distribution Issue in Graph Neural Networks via Normalization
Liang, Langzhang, Xu, Zenglin, Song, Zixing, King, Irwin, Qi, Yuan, Ye, Jieping
Graph Neural Networks (GNNs) have attracted much attention due to their ability in learning representations from graph-structured data. Despite the successful applications of GNNs in many domains, the optimization of GNNs is less well studied, and the performance on node classification heavily suffers from the long-tailed node degree distribution. This paper focuses on improving the performance of GNNs via normalization. In detail, by studying the long-tailed distribution of node degrees in the graph, we propose a novel normalization method for GNNs, which is termed ResNorm (\textbf{Res}haping the long-tailed distribution into a normal-like distribution via \textbf{norm}alization). The $scale$ operation of ResNorm reshapes the node-wise standard deviation (NStd) distribution so as to improve the accuracy of tail nodes (\textit{i}.\textit{e}., low-degree nodes). We provide a theoretical interpretation and empirical evidence for understanding the mechanism of the above $scale$. In addition to the long-tailed distribution issue, over-smoothing is also a fundamental issue plaguing the community. To this end, we analyze the behavior of the standard shift and prove that the standard shift serves as a preconditioner on the weight matrix, increasing the risk of over-smoothing. With the over-smoothing issue in mind, we design a $shift$ operation for ResNorm that simulates the degree-specific parameter strategy in a low-cost manner. Extensive experiments have validated the effectiveness of ResNorm on several node classification benchmark datasets.
A Survey on Graph Neural Networks for Time Series: Forecasting, Classification, Imputation, and Anomaly Detection
Jin, Ming, Koh, Huan Yee, Wen, Qingsong, Zambon, Daniele, Alippi, Cesare, Webb, Geoffrey I., King, Irwin, Pan, Shirui
Time series are the primary data type used to record dynamic system measurements and generated in great volume by both physical sensors and online processes (virtual sensors). Time series analytics is therefore crucial to unlocking the wealth of information implicit in available data. With the recent advancements in graph neural networks (GNNs), there has been a surge in GNN-based approaches for time series analysis. These approaches can explicitly model inter-temporal and inter-variable relationships, which traditional and other deep neural network-based methods struggle to do. In this survey, we provide a comprehensive review of graph neural networks for time series analysis (GNN4TS), encompassing four fundamental dimensions: forecasting, classification, anomaly detection, and imputation. Our aim is to guide designers and practitioners to understand, build applications, and advance research of GNN4TS. At first, we provide a comprehensive task-oriented taxonomy of GNN4TS. Then, we present and discuss representative research works and introduce mainstream applications of GNN4TS. A comprehensive discussion of potential future research directions completes the survey. This survey, for the first time, brings together a vast array of knowledge on GNN-based time series research, highlighting foundations, practical applications, and opportunities of graph neural networks for time series analysis.