Ding, Kaize
Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark
Wang, Yili, Liu, Yixin, Shen, Xu, Li, Chenyu, Ding, Kaize, Miao, Rui, Wang, Ying, Pan, Shirui, Wang, Xin
To build safe and reliable graph machine learning systems, unsupervised graph-level anomaly detection (GLAD) and unsupervised graph-level out-of-distribution (OOD) detection (GLOD) have received significant attention in recent years. Though those two lines of research indeed share the same objective, they have been studied independently in the community due to distinct evaluation setups, creating a gap that hinders the application and evaluation of methods from one to the other. To bridge the gap, in this work, we present a Unified Benchmark for unsupervised Graph-level OOD and anomaly Detection (our method), a comprehensive evaluation framework that unifies GLAD and GLOD under the concept of generalized graph-level OOD detection. Our benchmark encompasses 35 datasets spanning four practical anomaly and OOD detection scenarios, facilitating the comparison of 16 representative GLAD/GLOD methods. We conduct multi-dimensional analyses to explore the effectiveness, generalizability, robustness, and efficiency of existing methods, shedding light on their strengths and limitations. Furthermore, we provide an open-source codebase (https://github.com/UB-GOLD/UB-GOLD) of our method to foster reproducible research and outline potential directions for future investigations based on our insights.
TSI-Bench: Benchmarking Time Series Imputation
Du, Wenjie, Wang, Jun, Qian, Linglong, Yang, Yiyuan, Liu, Fanxing, Wang, Zepu, Ibrahim, Zina, Liu, Haoxin, Zhao, Zhiyuan, Zhou, Yingjie, Wang, Wenjia, Ding, Kaize, Liang, Yuxuan, Prakash, B. Aditya, Wen, Qingsong
Effective imputation is a crucial preprocessing step for time series analysis. Despite the development of numerous deep learning algorithms for time series imputation, the community lacks standardized and comprehensive benchmark platforms to effectively evaluate imputation performance across different settings. Moreover, although many deep learning forecasting algorithms have demonstrated excellent performance, whether their modeling achievements can be transferred to time series imputation tasks remains unexplored. To bridge these gaps, we develop TSI-Bench, the first (to our knowledge) comprehensive benchmark suite for time series imputation utilizing deep learning techniques. The TSI-Bench pipeline standardizes experimental settings to enable fair evaluation of imputation algorithms and identification of meaningful insights into the influence of domain-appropriate missingness ratios and patterns on model performance. Furthermore, TSI-Bench innovatively provides a systematic paradigm to tailor time series forecasting algorithms for imputation purposes. Our extensive study across 34,804 experiments, 28 algorithms, and 8 datasets with diverse missingness scenarios demonstrates TSI-Bench's effectiveness in diverse downstream tasks and potential to unlock future directions in time series imputation research and analysis.
Avoiding Copyright Infringement via Machine Unlearning
Dou, Guangyao, Liu, Zheyuan, Lyu, Qing, Ding, Kaize, Wong, Eric
This scenario involves unlearning specific books over time, followed by subsequent Large Language Models (LLMs) (Brown et al., unlearning requests. An effective algorithm 2020; Chowdhery et al., 2023; Touvron et al., 2023) should be stable, meaning it should ensure unlearning have made significant progress through pre-training efficacy--removing unwanted knowledge effectively--while on extensive transformer-based architectures and maintaining locality, preserving learning from diverse text data (Ouyang et al., 2022; non-targeted knowledge and the model's reasoning Kojima et al., 2022; Qin et al., 2023; Lewkowycz ability. Few works have studied this setting, et al., 2022; Roziere et al., 2023; Lyu et al., 2023; leaving it unclear if existing methods are suitable.
Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?
Jin, Mingyu, Yu, Qinkai, Huang, Jingyuan, Zeng, Qingcheng, Wang, Zhenting, Hua, Wenyue, Zhao, Haiyan, Mei, Kai, Meng, Yanda, Ding, Kaize, Yang, Fan, Du, Mengnan, Zhang, Yongfeng
Large language models (LLMs) have shown remarkable performances across a wide range of tasks. However, the mechanisms by which these models encode tasks of varying complexities remain poorly understood. In this paper, we explore the hypothesis that LLMs process concepts of varying complexities in different layers, introducing the idea of "Concept Depth" to suggest that more complex concepts are typically acquired in deeper layers. Specifically, we categorize concepts based on their level of abstraction, defining them in the order of increasing complexity within factual, emotional, and inferential tasks. We conduct extensive probing experiments using layer-wise representations across various LLM families (Gemma, LLaMA, QWen) on various datasets spanning the three domains of tasks. Our findings reveal that models could efficiently conduct probing for simpler tasks in shallow layers, and more complex tasks typically necessitate deeper layers for accurate understanding. Additionally, we examine how external factors, such as adding noise to the input and quantizing the model weights, might affect layer-wise representations. Our findings suggest that these factors can impede the development of a conceptual understanding of LLMs until deeper layers are explored. We hope that our proposed concept and experimental insights will enhance the understanding of the mechanisms underlying LLMs. Our codes are available at https://github.com/Luckfort/CD.
Empowering Large Language Models for Textual Data Augmentation
Li, Yichuan, Ding, Kaize, Wang, Jianling, Lee, Kyumin
With the capabilities of understanding and executing natural language instructions, Large language models (LLMs) can potentially act as a powerful tool for textual data augmentation. However, the quality of augmented data depends heavily on the augmentation instructions provided, and the effectiveness can fluctuate across different downstream tasks. While manually crafting and selecting instructions can offer some improvement, this approach faces scalability and consistency issues in practice due to the diversity of downstream tasks. In this work, we address these limitations by proposing a new solution, which can automatically generate a large pool of augmentation instructions and select the most suitable task-informed instructions, thereby empowering LLMs to create high-quality augmented data for different downstream tasks. Empirically, the proposed approach consistently generates augmented data with better quality compared to non-LLM and LLM-based data augmentation methods, leading to the best performance on 26 few-shot learning tasks sourced from a wide range of application domains.
Multitask Active Learning for Graph Anomaly Detection
Chang, Wenjing, Liu, Kay, Ding, Kaize, Yu, Philip S., Yu, Jianjun
In the web era, graph machine learning has been widely used on ubiquitous graph-structured data. As a pivotal component for bolstering web security and enhancing the robustness of graph-based applications, the significance of graph anomaly detection is continually increasing. While Graph Neural Networks (GNNs) have demonstrated efficacy in supervised and semi-supervised graph anomaly detection, their performance is contingent upon the availability of sufficient ground truth labels. The labor-intensive nature of identifying anomalies from complex graph structures poses a significant challenge in real-world applications. Despite that, the indirect supervision signals from other tasks (e.g., node classification) are relatively abundant. In this paper, we propose a novel MultItask acTIve Graph Anomaly deTEction framework, namely MITIGATE. Firstly, by coupling node classification tasks, MITIGATE obtains the capability to detect out-of-distribution nodes without known anomalies. Secondly, MITIGATE quantifies the informativeness of nodes by the confidence difference across tasks, allowing samples with conflicting predictions to provide informative yet not excessively challenging information for subsequent training. Finally, to enhance the likelihood of selecting representative nodes that are distant from known patterns, MITIGATE adopts a masked aggregation mechanism for distance measurement, considering both inherent features of nodes and current labeled status. Empirical studies on four datasets demonstrate that MITIGATE significantly outperforms the state-of-the-art methods for anomaly detection. Our code is publicly available at: https://github.com/AhaChang/MITIGATE.
STERLING: Synergistic Representation Learning on Bipartite Graphs
Jing, Baoyu, Yan, Yuchen, Ding, Kaize, Park, Chanyoung, Zhu, Yada, Liu, Huan, Tong, Hanghang
A fundamental challenge of bipartite graph representation learning is how to extract informative node embeddings. Self-Supervised Learning (SSL) is a promising paradigm to address this challenge. Most recent bipartite graph SSL methods are based on contrastive learning which learns embeddings by discriminating positive and negative node pairs. Contrastive learning usually requires a large number of negative node pairs, which could lead to computational burden and semantic errors. In this paper, we introduce a novel synergistic representation learning model (STERLING) to learn node embeddings without negative node pairs. STERLING preserves the unique local and global synergies in bipartite graphs. The local synergies are captured by maximizing the similarity of the inter-type and intra-type positive node pairs, and the global synergies are captured by maximizing the mutual information of co-clusters. Theoretical analysis demonstrates that STERLING could improve the connectivity between different node types in the embedding space. Extensive empirical evaluation on various benchmark datasets and tasks demonstrates the effectiveness of STERLING for extracting node embeddings.
GRENADE: Graph-Centric Language Model for Self-Supervised Representation Learning on Text-Attributed Graphs
Li, Yichuan, Ding, Kaize, Lee, Kyumin
Self-supervised representation learning on text-attributed graphs, which aims to create expressive and generalizable representations for various downstream tasks, has received increasing research attention lately. However, existing methods either struggle to capture the full extent of structural context information or rely on task-specific training labels, which largely hampers their effectiveness and generalizability in practice. To solve the problem of self-supervised representation learning on text-attributed graphs, we develop a novel Graph-Centric Language model -- GRENADE. Specifically, GRENADE exploits the synergistic effect of both pre-trained language model and graph neural network by optimizing with two specialized self-supervised learning algorithms: graph-centric contrastive learning and graph-centric knowledge alignment. The proposed graph-centric self-supervised learning algorithms effectively help GRENADE to capture informative textual semantics as well as structural context information on text-attributed graphs. Through extensive experiments, GRENADE shows its superiority over state-of-the-art methods. Implementation is available at \url{https://github.com/bigheiniu/GRENADE}.
Keypoint-Augmented Self-Supervised Learning for Medical Image Segmentation with Limited Annotation
Yang, Zhangsihao, Ren, Mengwei, Ding, Kaize, Gerig, Guido, Wang, Yalin
Pretraining CNN models (i.e., UNet) through self-supervision has become a powerful approach to facilitate medical image segmentation under low annotation regimes. Recent contrastive learning methods encourage similar global representations when the same image undergoes different transformations, or enforce invariance across different image/patch features that are intrinsically correlated. However, CNN-extracted global and local features are limited in capturing long-range spatial dependencies that are essential in biological anatomy. To this end, we present a keypoint-augmented fusion layer that extracts representations preserving both short- and long-range self-attention. In particular, we augment the CNN feature map at multiple scales by incorporating an additional input that learns long-range spatial self-attention among localized keypoint features. Further, we introduce both global and local self-supervised pretraining for the framework. At the global scale, we obtain global representations from both the bottleneck of the UNet, and by aggregating multiscale keypoint features. These global features are subsequently regularized through image-level contrastive objectives. At the local scale, we define a distance-based criterion to first establish correspondences among keypoints and encourage similarity between their features. Through extensive experiments on both MRI and CT segmentation tasks, we demonstrate the architectural advantages of our proposed method in comparison to both CNN and Transformer-based UNets, when all architectures are trained with randomly initialized weights. With our proposed pretraining strategy, our method further outperforms existing SSL methods by producing more robust self-attention and achieving state-of-the-art segmentation results. The code is available at https://github.com/zshyang/kaf.git.
UPREVE: An End-to-End Causal Discovery Benchmarking System
Unni, Suraj Jyothi, Sheth, Paras, Ding, Kaize, Liu, Huan, Candan, K. Selcuk
Discovering causal relationships in complex socio-behavioral systems is challenging but essential for informed decision-making. We present Upload, PREprocess, Visualize, and Evaluate (UPREVE), a user-friendly web-based graphical user interface (GUI) designed to simplify the process of causal discovery. UPREVE allows users to run multiple algorithms simultaneously, visualize causal relationships, and evaluate the accuracy of learned causal graphs. With its accessible interface and customizable features, UPREVE empowers researchers and practitioners in social computing and behavioral-cultural modeling (among others) to explore and understand causal relationships effectively. Our proposed solution aims to make causal discovery more accessible and user-friendly, enabling users to gain valuable insights for better decision-making.