Wu, Jiaxiang
Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting
Tan, Mingkui, Chen, Guohao, Wu, Jiaxiang, Zhang, Yifan, Chen, Yaofo, Zhao, Peilin, Niu, Shuaicheng
Abstract--Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and testing data by adapting a given model w.r.t. This task is particularly important when the test environment changes frequently. Although some recent attempts have been made to handle this task, we still face two key challenges: 1) prior methods have to perform backpropagation for each test sample, resulting in unbearable optimization costs to many applications; 2) while existing TTA solutions can significantly improve the test performance on out-of-distribution data, they often suffer from severe performance degradation on in-distribution data after TTA (known as catastrophic forgetting). To this end, we have proposed an Efficient Anti-Forgetting Test-Time Adaptation (EATA) method which develops an active sample selection criterion to identify reliable and non-redundant samples for test-time entropy minimization. To alleviate forgetting, EATA introduces a Fisher regularizer estimated from test samples to constrain important model parameters from drastic changes. However, in EATA, the adopted entropy loss consistently assigns higher confidence to predictions even when the samples are underlying uncertain, leading to overconfident predictions that underestimate the data uncertainty. To tackle this, we further propose EATA with Calibration (EATA-C) to separately exploit the reducible model uncertainty and the inherent data uncertainty for calibrated TTA. Specifically, we compare the divergence between predictions from the full network and its sub-networks to measure the reducible model uncertainty, on which we propose a test-time uncertainty reduction strategy with divergence minimization loss to encourage consistent predictions instead of overconfident ones. To further re-calibrate predicting confidence on different samples, we utilize the disagreement among predicted labels as an indicator of the data uncertainty. Based on this, we devise a min-max entropy regularization to selectively increase and decrease predicting confidence for confidence re-calibration. Note that EATA-C and EATA are different on the adaptation objective, while EATA-C still benefits from the active sample selection criterion and anti-forgetting Fisher regularization proposed in EATA. Extensive experiments on image classification and semantic segmentation verify the effectiveness of our proposed methods.
A Synchronized Layer-by-layer Growing Approach for Plausible Neuronal Morphology Generation
Yang, Nianzu, Zeng, Kaipeng, Lu, Haotian, Wu, Yexin, Yuan, Zexin, Jiang, Shengdian, Wu, Jiaxiang, Wang, Yimin, Yan, Junchi
Neuronal morphology is essential for studying brain functioning and understanding neurodegenerative disorders. As the acquiring of real-world morphology data is expensive, computational approaches especially learning-based ones e.g. MorphVAE for morphology generation were recently studied, which are often conducted in a way of randomly augmenting a given authentic morphology to achieve plausibility. Under such a setting, this paper proposes \textbf{MorphGrower} which aims to generate more plausible morphology samples by mimicking the natural growth mechanism instead of a one-shot treatment as done in MorphVAE. Specifically, MorphGrower generates morphologies layer by layer synchronously and chooses a pair of sibling branches as the basic generation block, and the generation of each layer is conditioned on the morphological structure of previous layers and then generate morphologies via a conditional variational autoencoder with spherical latent space. Extensive experimental results on four real-world datasets demonstrate that MorphGrower outperforms MorphVAE by a notable margin. Our code will be publicly available to facilitate future research.
PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for Personality Detection
Yang, Tao, Shi, Tianyuan, Wan, Fanqi, Quan, Xiaojun, Wang, Qifan, Wu, Bingzhe, Wu, Jiaxiang
Recent advances in large language models (LLMs), such as ChatGPT, have showcased remarkable zero-shot performance across various NLP tasks. However, the potential of LLMs in personality detection, which involves identifying an individual's personality from their written texts, remains largely unexplored. Drawing inspiration from Psychological Questionnaires, which are carefully designed by psychologists to evaluate individual personality traits through a series of targeted items, we argue that these items can be regarded as a collection of well-structured chain-of-thought (CoT) processes. By incorporating these processes, LLMs can enhance their capabilities to make more reasonable inferences on personality from textual input. In light of this, we propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner. In particular, we employ a LLM as an AI assistant with a specialization in text analysis. We prompt the assistant to rate individual items at each turn and leverage the historical rating results to derive a conclusive personality preference. Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection, achieving an average F1 score improvement of 4.23/10.63 points on two benchmark datasets compared to the standard prompting method. Our code is available at https://github.com/TaoYang225/PsyCoT.
Multi-scale Attention Flow for Probabilistic Time Series Forecasting
Feng, Shibo, Miao, Chunyan, Xu, Ke, Wu, Jiaxiang, Wu, Pengcheng, Zhang, Yang, Zhao, Peilin
The probability prediction of multivariate time series is a notoriously challenging but practical task. On the one hand, the challenge is how to effectively capture the cross-series correlations between interacting time series, to achieve accurate distribution modeling. On the other hand, we should consider how to capture the contextual information within time series more accurately to model multivariate temporal dynamics of time series. In this work, we proposed a novel non-autoregressive deep learning model, called Multi-scale Attention Normalizing Flow(MANF), where we integrate multi-scale attention and relative position information and the multivariate data distribution is represented by the conditioned normalizing flow. Additionally, compared with autoregressive modeling methods, our model avoids the influence of cumulative error and does not increase the time complexity. Extensive experiments demonstrate that our model achieves state-of-the-art performance on many popular multivariate datasets.
RPTQ: Reorder-based Post-training Quantization for Large Language Models
Yuan, Zhihang, Niu, Lin, Liu, Jiawei, Liu, Wenyu, Wang, Xinggang, Shang, Yuzhang, Sun, Guangyu, Wu, Qiang, Wu, Jiaxiang, Wu, Bingzhe
Large-scale language models (LLMs) have demonstrated impressive performance, but their deployment presents challenges due to their significant memory usage. This issue can be alleviated through quantization. In this paper, we identify that the challenge in quantizing activations in LLMs arises from varying ranges across channels, rather than solely the presence of outliers. To address this challenge, we introduce a quantization method called RPTQ, which utilizes a reorder-based approach. By rearranging the channels and quantizing them in clusters, RPTQ effectively mitigates the impact of range differences between channels. To minimize the overhead of the reorder operation, we fuse it into the layer norm operation and weights in linear layers. In our experiments, RPTQ achieved a significant breakthrough by utilizing 3-bit activation in LLMs for the first time, resulting in a substantial reduction in memory usage. For instance, quantizing OPT-175b can lead to a memory consumption reduction of up to 80%.
Benchmarking the Reliability of Post-training Quantization: a Particular Focus on Worst-case Performance
Yuan, Zhihang, Liu, Jiawei, Wu, Jiaxiang, Yang, Dawei, Wu, Qiang, Sun, Guangyu, Liu, Wenyu, Wang, Xinggang, Wu, Bingzhe
Post-training quantization (PTQ) is a popular method for compressing deep neural networks (DNNs) without modifying their original architecture or training procedures. Despite its effectiveness and convenience, the reliability of PTQ methods in the presence of some extrem cases such as distribution shift and data noise remains largely unexplored. This paper first investigates this problem on various commonly-used PTQ methods. We aim to answer several research questions related to the influence of calibration set distribution variations, calibration paradigm selection, and data augmentation or sampling strategies on PTQ reliability. A systematic evaluation process is conducted across a wide range of tasks and commonly-used PTQ paradigms. The results show that most existing PTQ methods are not reliable enough in term of the worst-case group performance, highlighting the need for more robust methods. Our findings provide insights for developing PTQ methods that can effectively handle distribution shift scenarios and enable the deployment of quantized DNNs in real-world applications.
Towards Stable Test-Time Adaptation in Dynamic Wild World
Niu, Shuaicheng, Wu, Jiaxiang, Zhang, Yifan, Wen, Zhiquan, Chen, Yaofo, Zhao, Peilin, Tan, Mingkui
Test-time adaptation (TTA) has shown to be effective at tackling distribution shifts between training and testing data by adapting a given model on test samples. However, the online model updating of TTA may be unstable and this is often a key obstacle preventing existing TTA methods from being deployed in the real world. Specifically, TTA may fail to improve or even harm the model performance when test data have: 1) mixed distribution shifts, 2) small batch sizes, and 3) online imbalanced label distribution shifts, which are quite common in practice. In this paper, we investigate the unstable reasons and find that the batch norm layer is a crucial factor hindering TTA stability. Conversely, TTA can perform more stably with batch-agnostic norm layers, \ie, group or layer norm. However, we observe that TTA with group and layer norms does not always succeed and still suffers many failure cases. By digging into the failure cases, we find that certain noisy test samples with large gradients may disturb the model adaption and result in collapsed trivial solutions, \ie, assigning the same class label for all samples. To address the above collapse issue, we propose a sharpness-aware and reliable entropy minimization method, called SAR, for further stabilizing TTA from two aspects: 1) remove partial noisy samples with large gradients, 2) encourage model weights to go to a flat minimum so that the model is robust to the remaining noisy samples. Promising results demonstrate that SAR performs more stably over prior methods and is computationally efficient under the above wild test scenarios.
DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations
Ji, Yuanfeng, Zhang, Lu, Wu, Jiaxiang, Wu, Bingzhe, Huang, Long-Kai, Xu, Tingyang, Rong, Yu, Li, Lanqing, Ren, Jie, Xue, Ding, Lai, Houtim, Xu, Shaoyong, Feng, Jing, Liu, Wei, Luo, Ping, Zhou, Shuigeng, Huang, Junzhou, Zhao, Peilin, Bian, Yatao
AI-aided drug discovery (AIDD) is gaining increasing popularity due to its promise of making the search for new pharmaceuticals quicker, cheaper and more efficient. In spite of its extensive use in many fields, such as ADMET prediction, virtual screening, protein folding and generative chemistry, little has been explored in terms of the out-of-distribution (OOD) learning problem with \emph{noise}, which is inevitable in real world AIDD applications. In this work, we present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery, which comes with an open-source Python package that fully automates the data curation and OOD benchmarking processes. We focus on one of the most crucial problems in AIDD: drug target binding affinity prediction, which involves both macromolecule (protein target) and small-molecule (drug compound). In contrast to only providing fixed datasets, DrugOOD offers automated dataset curator with user-friendly customization scripts, rich domain annotations aligned with biochemistry knowledge, realistic noise annotations and rigorous benchmarking of state-of-the-art OOD algorithms. Since the molecular data is often modeled as irregular graphs using graph neural network (GNN) backbones, DrugOOD also serves as a valuable testbed for \emph{graph OOD learning} problems. Extensive empirical studies have shown a significant performance gap between in-distribution and out-of-distribution experiments, which highlights the need to develop better schemes that can allow for OOD generalization under noise for AIDD.
Federated Face Recognition
Bai, Fan, Wu, Jiaxiang, Shen, Pengcheng, Li, Shaoxin, Zhou, Shuigeng
Face recognition has been extensively studied in computer vision and artificial intelligence communities in recent years. An important issue of face recognition is data privacy, which receives more and more public concerns. As a common privacy-preserving technique, Federated Learning is proposed to train a model cooperatively without sharing data between parties. However, as far as we know, it has not been successfully applied in face recognition. This paper proposes a framework named FedFace to innovate federated learning for face recognition. Specifically, FedFace relies on two major innovative algorithms, Partially Federated Momentum (PFM) and Federated Validation (FV). PFM locally applies an estimated equivalent global momentum to approximating the centralized momentum-SGD efficiently. FV repeatedly searches for better federated aggregating weightings via testing the aggregated models on some private validation datasets, which can improve the model's generalization ability. The ablation study and extensive experiments validate the effectiveness of the FedFace method and show that it is comparable to or even better than the centralized baseline in performance.
An Efficient Approach to Informative Feature Extraction from Multimodal Data
Wang, Lichen, Wu, Jiaxiang, Huang, Shao-Lun, Zheng, Lizhong, Xu, Xiangxiang, Zhang, Lin, Huang, Junzhou
One primary focus in multimodal feature extraction is to find the representations of individual modalities that are maximally correlated. As a well-known measure of dependence, the Hirschfeld-Gebelein-R\'{e}nyi (HGR) maximal correlation becomes an appealing objective because of its operational meaning and desirable properties. However, the strict whitening constraints formalized in the HGR maximal correlation limit its application. To address this problem, this paper proposes Soft-HGR, a novel framework to extract informative features from multiple data modalities. Specifically, our framework prevents the "hard" whitening constraints, while simultaneously preserving the same feature geometry as in the HGR maximal correlation. The objective of Soft-HGR is straightforward, only involving two inner products, which guarantees the efficiency and stability in optimization. We further generalize the framework to handle more than two modalities and missing modalities. When labels are partially available, we enhance the discriminative power of the feature representations by making a semi-supervised adaptation. Empirical evaluation implies that our approach learns more informative feature mappings and is more efficient to optimize.