Cui, Peng
Deep Ensembles Meets Quantile Regression: Uncertainty-aware Imputation for Time Series
Liu, Ying, Cui, Peng, Hu, Wenbo, Hong, Richang
Multivariate time series are everywhere. Nevertheless, real-world time series data often exhibit numerous missing values, which is the time series imputation task. Although previous deep learning methods have been shown to be effective for time series imputation, they are shown to produce overconfident imputations, which might be a potentially overlooked threat to the reliability of the intelligence system. Score-based diffusion method(i.e., CSDI) is effective for the time series imputation task but computationally expensive due to the nature of the generative diffusion model framework. In this paper, we propose a non-generative time series imputation method that produces accurate imputations with inherent uncertainty and meanwhile is computationally efficient. Specifically, we incorporate deep ensembles into quantile regression with a shared model backbone and a series of quantile discrimination functions.This framework combines the merits of accurate uncertainty estimation of deep ensembles and quantile regression and above all, the shared model backbone tremendously reduces most of the computation overhead of the multiple ensembles. We examine the performance of the proposed method on two real-world datasets: air quality and health-care datasets and conduct extensive experiments to show that our method excels at making deterministic and probabilistic predictions. Compared with the score-based diffusion method: CSDI, we can obtain comparable forecasting results and is better when more data is missing. Furthermore, as a non-generative model compared with CSDI, the proposed method consumes a much smaller computation overhead, yielding much faster training speed and fewer model parameters.
Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting
He, Guande, Cui, Peng, Chen, Jianfei, Hu, Wenbo, Zhu, Jun
Despite the significant progress made in practical applications of aligned language models (LMs), they tend to be overconfident in output answers compared to the corresponding pre-trained LMs. In this work, we systematically evaluate the impact of the alignment process on logit-based uncertainty calibration of LMs under the multiple-choice setting. We first conduct a thoughtful empirical study on how aligned LMs differ in calibration from their pre-trained counterparts. Experimental results reveal that there are two distinct uncertainties in LMs under the multiple-choice setting, which are responsible for the answer decision and the format preference of the LMs, respectively. Then, we investigate the role of these two uncertainties on aligned LM's calibration through fine-tuning in simple synthetic alignment schemes and conclude that one reason for aligned LMs' overconfidence is the conflation of these two types of uncertainty. Furthermore, we examine the utility of common post-hoc calibration methods for aligned LMs and propose an easy-to-implement and sample-efficient method to calibrate aligned LMs. We hope our findings could provide insights into the design of more reliable alignment processes for LMs.
Geometry-Calibrated DRO: Combating Over-Pessimism with Free Energy Implications
Liu, Jiashuo, Wu, Jiayun, Wang, Tianyu, Zou, Hao, Li, Bo, Cui, Peng
Machine learning algorithms minimizing average risk are susceptible to distributional shifts. Distributionally Robust Optimization (DRO) addresses this issue by optimizing the worst-case risk within an uncertainty set. However, DRO suffers from over-pessimism, leading to low-confidence predictions, poor parameter estimations as well as poor generalization. In this work, we conduct a theoretical analysis of a probable root cause of over-pessimism: excessive focus on noisy samples. To alleviate the impact of noise, we incorporate data geometry into calibration terms in DRO, resulting in our novel Geometry-Calibrated DRO (GCDRO) for regression. We establish the connection between our risk objective and the Helmholtz free energy in statistical physics, and this free-energy-based risk can extend to standard DRO methods. Leveraging gradient flow in Wasserstein space, we develop an approximate minimax optimization algorithm with a bounded error ratio and elucidate how our approach mitigates noisy sample effects. Comprehensive experiments confirm GCDRO's superiority over conventional DRO methods.
Towards Accelerated Model Training via Bayesian Data Selection
Deng, Zhijie, Cui, Peng, Zhu, Jun
Mislabeled, duplicated, or biased data in real-world scenarios can lead to prolonged training and even hinder model convergence. Traditional solutions prioritizing easy or hard samples lack the flexibility to handle such a variety simultaneously. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. However, its practical adoption relies on less principled approximations and additional holdout data. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models. The resulting algorithm is efficient and easy to implement. We perform extensive empirical studies on challenging benchmarks with considerable data noise and imbalance in the online batch selection scenario, and observe superior training efficiency over competitive baselines. Notably, on the challenging Web-Vision benchmark, our method can achieve similar predictive performance with significantly fewer training iterations than leading data selection methods.
Learning Sample Difficulty from Pre-trained Models for Reliable Prediction
Cui, Peng, Zhang, Dan, Deng, Zhijie, Dong, Yinpeng, Zhu, Jun
Large-scale pre-trained models have achieved remarkable success in many applications, but how to leverage them to improve the prediction reliability of downstream models is undesirably under-explored. Moreover, modern neural networks have been found to be poorly calibrated and make overconfident predictions regardless of inherent sample difficulty and data uncertainty. To address this issue, we propose to utilize large-scale pre-trained models to guide downstream model training with sample difficulty-aware entropy regularization. Pre-trained models that have been exposed to large-scale datasets and do not overfit the downstream training classes enable us to measure each training sample's difficulty via feature-space Gaussian modeling and relative Mahalanobis distance computation. Importantly, by adaptively penalizing overconfident prediction based on the sample difficulty, we simultaneously improve accuracy and uncertainty calibration across challenging benchmarks (e.g., +0.55% ACC and -3.7% ECE on ImageNet1k using ResNet34), consistently surpassing competitive baselines for reliable prediction. The improved uncertainty estimate further improves selective classification (abstaining from erroneous predictions) and out-of-distribution detection.
Heterogeneous Multi-Task Gaussian Cox Processes
Zhou, Feng, Kong, Quyu, Deng, Zhijie, He, Fengxiang, Cui, Peng, Zhu, Jun
Inhomogeneous Poisson process data defined on a continuous spatio-temporal domain has attracted immense attention recently in a wide variety of applications, including reliability analysis in manufacturing systems (Soleimani et al, 2017), event capture in sensing regions (Mutny and Krause, 2021), crime prediction in urban area (Shirota and Gelfand, 2017) and disease diagnosis based on medical records (Lasko, 2014). The reliable training of an inhomogeneous Poisson process model critically relies on a large amount of data to avoid overfitting, especially when modeling high-dimensional point processes. However, one challenge is that the available training data is routinely sparse or even partially missing in specific applications. Taking manufacturing failure and healthcare analysis as motivating examples: the modern manufacturing machines are reliable and sparsely fail; the individuals with healthy constitution will not visit hospital very often. The data missing problems also arise, e.g., the event location capture is intermittent for sensing systems because of weather or other related barriers.
Competing for Shareable Arms in Multi-Player Multi-Armed Bandits
Xu, Renzhe, Wang, Haotian, Zhang, Xingxuan, Li, Bo, Cui, Peng
Competitions for shareable and limited resources have long been studied with strategic agents. In reality, agents often have to learn and maximize the rewards of the resources at the same time. To design an individualized competing policy, we model the competition between agents in a novel multi-player multi-armed bandit (MPMAB) setting where players are selfish and aim to maximize their own rewards. In addition, when several players pull the same arm, we assume that these players averagely share the arms' rewards by expectation. Under this setting, we first analyze the Nash equilibrium when arms' rewards are known. Subsequently, we propose a novel Selfish MPMAB with Averaging Allocation (SMAA) approach based on the equilibrium. We theoretically demonstrate that SMAA could achieve a good regret guarantee for each player when all players follow the algorithm. Additionally, we establish that no single selfish player can significantly increase their rewards through deviation, nor can they detrimentally affect other players' rewards without incurring substantial losses for themselves. We finally validate the effectiveness of the method in extensive synthetic experiments.
Towards Out-Of-Distribution Generalization: A Survey
Liu, Jiashuo, Shen, Zheyan, He, Yue, Zhang, Xingxuan, Xu, Renzhe, Yu, Han, Cui, Peng
Traditional machine learning paradigms are based on the assumption that both training and test data follow the same statistical pattern, which is mathematically referred to as Independent and Identically Distributed ($i.i.d.$). However, in real-world applications, this $i.i.d.$ assumption often fails to hold due to unforeseen distributional shifts, leading to considerable degradation in model performance upon deployment. This observed discrepancy indicates the significance of investigating the Out-of-Distribution (OOD) generalization problem. OOD generalization is an emerging topic of machine learning research that focuses on complex scenarios wherein the distributions of the test data differ from those of the training data. This paper represents the first comprehensive, systematic review of OOD generalization, encompassing a spectrum of aspects from problem definition, methodological development, and evaluation procedures, to the implications and future directions of the field. Our discussion begins with a precise, formal characterization of the OOD generalization problem. Following that, we categorize existing methodologies into three segments: unsupervised representation learning, supervised model learning, and optimization, according to their positions within the overarching learning process. We provide an in-depth discussion on representative methodologies for each category, further elucidating the theoretical links between them. Subsequently, we outline the prevailing benchmark datasets employed in OOD generalization studies. To conclude, we overview the existing body of work in this domain and suggest potential avenues for future research on OOD generalization. A summary of the OOD generalization methodologies surveyed in this paper can be accessed at http://out-of-distribution-generalization.com.
On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets
Liu, Jiashuo, Wang, Tianyu, Cui, Peng, Namkoong, Hongseok
The performance of predictive models has been observed to degrade under distribution shifts in a wide range of applications, such as healthcare [8, 68, 56, 67], economics [28, 18], education [5], vision [55, 47, 64, 70], and language [46, 6]. Distribution shifts vary in type, typically defined as either a change in the marginal distribution of the covariates (X-shifts), or changes in the conditional relationship between the outcome and covariate (Y |X-shifts). Real-world scenarios comprise of both types of shifts. In computer vision [46, 37, 60, 30, 72], Y |X-shifts are less likely as Y is constructed from human labels given an input X. Due to the prevalence of X-shifts, the implicit goal of many researchers is to develop a single robust model that can generalize effectively across multiple domains, akin to humans. For tabular data, Y |X-shifts may arise because of missing variables and hidden confounders. For example, the prevalence of diseases among patients may be affected by covariates that are not recorded in medical datasets but vary among individuals, such as lifestyle factors (e.g., diet, exercise, smoking status) and socioeconomic status [31, 74, 67]. Under Y |X-shifts, there may be a fundamental trade-off between learning algorithms: to perform well on a target distribution, a model may have to necessarily perform worse on others. Algorithmically, typical methods for addressing Y |X-shifts include distributionally robust optimization (DRO) [11, 63, 21, 59, 20] and causal learning methods [54, 7, 62, 36].
Inductive Meta-path Learning for Schema-complex Heterogeneous Information Networks
Liu, Shixuan, Fan, Changjun, Cheng, Kewei, Wang, Yunfei, Cui, Peng, Sun, Yizhou, Liu, Zhong
Heterogeneous Information Networks (HINs) are information networks with multiple types of nodes and edges. The concept of meta-path, i.e., a sequence of entity types and relation types connecting two entities, is proposed to provide the meta-level explainable semantics for various HIN tasks. Traditionally, meta-paths are primarily used for schema-simple HINs, e.g., bibliographic networks with only a few entity types, where meta-paths are often enumerated with domain knowledge. However, the adoption of meta-paths for schema-complex HINs, such as knowledge bases (KBs) with hundreds of entity and relation types, has been limited due to the computational complexity associated with meta-path enumeration. Additionally, effectively assessing meta-paths requires enumerating relevant path instances, which adds further complexity to the meta-path learning process. To address these challenges, we propose SchemaWalk, an inductive meta-path learning framework for schema-complex HINs. We represent meta-paths with schema-level representations to support the learning of the scores of meta-paths for varying relations, mitigating the need of exhaustive path instance enumeration for each relation. Further, we design a reinforcement-learning based path-finding agent, which directly navigates the network schema (i.e., schema graph) to learn policies for establishing meta-paths with high coverage and confidence for multiple relations. Extensive experiments on real data sets demonstrate the effectiveness of our proposed paradigm.