Li, Chenyu
RedChronos: A Large Language Model-Based Log Analysis System for Insider Threat Detection in Enterprises
Li, Chenyu, Zhu, Zhengjia, He, Jiyan, Zhang, Xiu
Internal threat detection (IDT) aims to address security threats within organizations or enterprises by identifying potential or already occurring malicious threats within vast amounts of logs. Although organizations or enterprises have dedicated personnel responsible for reviewing these logs, it is impossible to manually examine all logs entirely.In response to the vast number of logs, we propose a system called RedChronos, which is a Large Language Model-Based Log Analysis System. This system incorporates innovative improvements over previous research by employing Query-Aware Weighted Voting and a Semantic Expansion-based Genetic Algorithm with LLM-driven Mutations. On the public datasets CERT 4.2 and 5.2, RedChronos outperforms or matches existing approaches in terms of accuracy, precision, and detection rate. Moreover, RedChronos reduces the need for manual intervention in security log reviews by approximately 90% in the Xiaohongshu Security Operation Center. Therefore, our RedChronos system demonstrates exceptional performance in handling IDT tasks, providing innovative solutions for these challenges. We believe that future research can continue to enhance the system's performance in IDT tasks while also reducing the response time to internal risk events.
SeaMo: A Multi-Seasonal and Multimodal Remote Sensing Foundation Model
Li, Xuyang, Hong, Danfeng, Li, Chenyu, Chanussot, Jocelyn
Remote Sensing (RS) data contains a wealth of multi-dimensional information crucial for Earth observation. Owing to its vast volume, diverse sources, and temporal properties, RS data is highly suitable for the development of large Visual Foundation Models (VFMs). VFMs act as robust feature extractors, learning from extensive RS data, and are subsequently fine-tuned for deployment in various geoscientific tasks. However, current VFMs in the RS domain are predominantly pretrained and tailored exclusively for specific characteristics of RS imagery, neglecting the potential of utilizing the multi-dimensional properties of RS data. Therefore, in this work, we propose SeaMo, a pioneering visual foundation model that integrates multi-seasonal and multimodal information in the RS field. SeaMo is designed to harness multiple properties of RS data. Within the masked image modeling framework, we employ non-aligned cropping techniques to extract spatial properties, use multi-source inputs for multimodal integration, and incorporate temporal-multimodal fusion blocks for effective assimilation of multi-seasonal data. SeaMo explicitly models the multi-dimensional properties of RS data, making the model more comprehensive, robust, and versatile. We applied SeaMo to several downstream geoscience tasks, which demonstrated exceptional performance. Extensive ablation studies were conducted to validate the model's superiority.
Airway Labeling Meets Clinical Applications: Reflecting Topology Consistency and Outliers via Learnable Attentions
Li, Chenyu, Zhang, Minghui, Zhang, Chuyan, Gu, Yun
Accurate airway anatomical labeling is crucial for clinicians to identify and navigate complex bronchial structures during bronchoscopy. Automatic airway anatomical labeling is challenging due to significant individual variability and anatomical variations. Previous methods are prone to generate inconsistent predictions, which is harmful for preoperative planning and intraoperative navigation. This paper aims to address these challenges by proposing a novel method that enhances topological consistency and improves the detection of abnormal airway branches. We propose a novel approach incorporating two modules: the Soft Subtree Consistency (SSC) and the Abnormal Branch Saliency (ABS). The SSC module constructs a soft subtree to capture clinically relevant topological relationships, allowing for flexible feature aggregation within and across subtrees. The ABS module facilitates the interaction between node features and prototypes to distinguish abnormal branches, preventing the erroneous aggregation of features between normal and abnormal nodes. Evaluated on a challenging dataset characterized by severe airway distortion and atrophy, our method achieves superior performance compared to state-of-the-art approaches. Specifically, it attains a 91.4% accuracy at the segmental level and an 83.7% accuracy at the subsegmental level, representing a 1.4% increase in subsegmental accuracy and a 3.1% increase in topological consistency. Notably, the method demonstrates reliable performance in cases with disease-induced airway deformities, ensuring consistent and accurate labeling.
Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark
Wang, Yili, Liu, Yixin, Shen, Xu, Li, Chenyu, Ding, Kaize, Miao, Rui, Wang, Ying, Pan, Shirui, Wang, Xin
To build safe and reliable graph machine learning systems, unsupervised graph-level anomaly detection (GLAD) and unsupervised graph-level out-of-distribution (OOD) detection (GLOD) have received significant attention in recent years. Though those two lines of research indeed share the same objective, they have been studied independently in the community due to distinct evaluation setups, creating a gap that hinders the application and evaluation of methods from one to the other. To bridge the gap, in this work, we present a Unified Benchmark for unsupervised Graph-level OOD and anomaly Detection (our method), a comprehensive evaluation framework that unifies GLAD and GLOD under the concept of generalized graph-level OOD detection. Our benchmark encompasses 35 datasets spanning four practical anomaly and OOD detection scenarios, facilitating the comparison of 16 representative GLAD/GLOD methods. We conduct multi-dimensional analyses to explore the effectiveness, generalizability, robustness, and efficiency of existing methods, shedding light on their strengths and limitations. Furthermore, we provide an open-source codebase (https://github.com/UB-GOLD/UB-GOLD) of our method to foster reproducible research and outline potential directions for future investigations based on our insights.
Masked Face Recognition with Generative-to-Discriminative Representations
Ge, Shiming, Guo, Weijia, Li, Chenyu, Zhang, Junzheng, Li, Yong, Zeng, Dan
Masked face recognition is important for social good but challenged by diverse occlusions that cause insufficient or inaccurate representations. In this work, we propose a unified deep network to learn generative-to-discriminative representations for facilitating masked face recognition. To this end, we split the network into three modules and learn them on synthetic masked faces in a greedy module-wise pretraining manner. First, we leverage a generative encoder pretrained for face inpainting and finetune it to represent masked faces into category-aware descriptors. Attribute to the generative encoder's ability in recovering context information, the resulting descriptors can provide occlusion-robust representations for masked faces, mitigating the effect of diverse masks. Then, we incorporate a multi-layer convolutional network as a discriminative reformer and learn it to convert the category-aware descriptors into identity-aware vectors, where the learning is effectively supervised by distilling relation knowledge from off-the-shelf face recognition model. In this way, the discriminative reformer together with the generative encoder serves as the pretrained backbone, providing general and discriminative representations towards masked faces. Finally, we cascade one fully-connected layer following by one softmax layer into a feature classifier and finetune it to identify the reformed identity-aware vectors. Extensive experiments on synthetic and realistic datasets demonstrate the effectiveness of our approach in recognizing masked faces.
Low-Rank Representations Meets Deep Unfolding: A Generalized and Interpretable Network for Hyperspectral Anomaly Detection
Li, Chenyu, Zhang, Bing, Hong, Danfeng, Yao, Jing, Chanussot, Jocelyn
Current hyperspectral anomaly detection (HAD) benchmark datasets suffer from low resolution, simple background, and small size of the detection data. These factors also limit the performance of the well-known low-rank representation (LRR) models in terms of robustness on the separation of background and target features and the reliance on manual parameter selection. To this end, we build a new set of HAD benchmark datasets for improving the robustness of the HAD algorithm in complex scenarios, AIR-HAD for short. Accordingly, we propose a generalized and interpretable HAD network by deeply unfolding a dictionary-learnable LLR model, named LRR-Net$^+$, which is capable of spectrally decoupling the background structure and object properties in a more generalized fashion and eliminating the bias introduced by vital interference targets concurrently. In addition, LRR-Net$^+$ integrates the solution process of the Alternating Direction Method of Multipliers (ADMM) optimizer with the deep network, guiding its search process and imparting a level of interpretability to parameter optimization. Additionally, the integration of physical models with DL techniques eliminates the need for manual parameter tuning. The manually tuned parameters are seamlessly transformed into trainable parameters for deep neural networks, facilitating a more efficient and automated optimization process. Extensive experiments conducted on the AIR-HAD dataset show the superiority of our LRR-Net$^+$ in terms of detection performance and generalization ability, compared to top-performing rivals. Furthermore, the compilable codes and our AIR-HAD benchmark datasets in this paper will be made available freely and openly at \url{https://sites.google.com/view/danfeng-hong}.
Timer: Transformers for Time Series Analysis at Scale
Liu, Yong, Zhang, Haoran, Li, Chenyu, Huang, Xiangdong, Wang, Jianmin, Long, Mingsheng
Deep learning has contributed remarkably to the advancement of time series analysis. Still, deep models can encounter performance bottlenecks in real-world small-sample scenarios, which can be concealed due to the performance saturation with small models on current benchmarks. Meanwhile, large models have demonstrated great powers in these scenarios through large-scale pre-training. Continuous progresses have been achieved as the emergence of large language models, exhibiting unprecedented ability in few-shot generalization, scalability, and task generality, which is however absent in time series models. To change the current practices of training small models on specific datasets from scratch, this paper aims at an early development of large time series models (LTSM). During pre-training, we curate large-scale datasets with up to 1 billion time points, unify heterogeneous time series into single-series sequence (S3) format, and develop the GPT-style architecture toward LTSMs. To meet diverse application needs, we convert forecasting, imputation, and anomaly detection of time series into a unified generative task. The outcome of this study is a Time Series Transformer (Timer), that is pre-trained by autoregressive next token prediction on large multi-domain datasets, and is fine-tuned to downstream scenarios with promising abilities as an LTSM.
Experimenting with an Evaluation Framework for Imbalanced Data Learning (EFIDL)
Li, Chenyu, Jiang, Xia
Introduction Data imbalance is one of the crucial issues in big data analysis with fewer labels. For example, in real-world healthcare data, spam detection labels, and financial fraud detection datasets. Many data balance methods were introduced to improve machine learning algorithms' performance. Research claims SMOTE and SMOTE-based data-augmentation (generate new data points) methods could improve algorithm performance. However, we found in many online tutorials, the valuation methods were applied based on synthesized datasets that introduced bias into the evaluation, and the performance got a false improvement. In this study, we proposed, a new evaluation framework for imbalanced data learning methods. We have experimented on five data balance methods and whether the performance of algorithms will improve or not. Methods We collected 8 imbalanced healthcare datasets with different imbalanced rates from different domains. Applied 6 data augmentation methods with 11 machine learning methods testing if the data augmentation will help with improving machine learning performance. We compared the traditional data augmentation evaluation methods with our proposed cross-validation evaluation framework Results Using traditional data augmentation evaluation meta hods will give a false impression of improving the performance. However, our proposed evaluation method shows data augmentation has limited ability to improve the results. Conclusion EFIDL is more suitable for evaluating the prediction performance of an ML method when data are augmented. Using an unsuitable evaluation framework will give false results. Future researchers should consider the evaluation framework we proposed when dealing with augmented datasets. Our experiments showed data augmentation does not help improve ML prediction performance.