Tang, Weijing
A Versatile Influence Function for Data Attribution with Non-Decomposable Loss
Deng, Junwei, Tang, Weijing, Ma, Jiaqi W.
Influence function, a technique rooted in robust statistics, has been adapted in modern machine learning for a novel application: data attribution -- quantifying how individual training data points affect a model's predictions. However, the common derivation of influence functions in the data attribution literature is limited to loss functions that can be decomposed into a sum of individual data point losses, with the most prominent examples known as M-estimators. This restricts the application of influence functions to more complex learning objectives, which we refer to as non-decomposable losses, such as contrastive or ranking losses, where a unit loss term depends on multiple data points and cannot be decomposed further. In this work, we bridge this gap by revisiting the general formulation of influence function from robust statistics, which extends beyond M-estimators. Based on this formulation, we propose a novel method, the Versatile Influence Function (VIF), that can be straightforwardly applied to machine learning models trained with any non-decomposable loss. In comparison to the classical approach in statistics, the proposed VIF is designed to fully leverage the power of auto-differentiation, hereby eliminating the need for case-specific derivations of each loss function. We demonstrate the effectiveness of VIF across three examples: Cox regression for survival analysis, node embedding for network analysis, and listwise learning-to-rank for information retrieval. In all cases, the influence estimated by VIF closely resembles the results obtained by brute-force leave-one-out retraining, while being up to $10^3$ times faster to compute. We believe VIF represents a significant advancement in data attribution, enabling efficient influence-function-based attribution across a wide range of machine learning paradigms, with broad potential for practical use cases.
Minimax Regret Learning for Data with Heterogeneous Subgroups
Mo, Weibin, Tang, Weijing, Xue, Songkai, Liu, Yufeng, Zhu, Ji
Modern complex datasets often consist of various sub-populations. To develop robust and generalizable methods in the presence of sub-population heterogeneity, it is important to guarantee a uniform learning performance instead of an average one. In many applications, prior information is often available on which sub-population or group the data points belong to. Given the observed groups of data, we develop a min-max-regret (MMR) learning framework for general supervised learning, which targets to minimize the worst-group regret. Motivated from the regret-based decision theoretic framework, the proposed MMR is distinguished from the value-based or risk-based robust learning methods in the existing literature. The regret criterion features several robustness and invariance properties simultaneously. In terms of generalizability, we develop the theoretical guarantee for the worst-case regret over a super-population of the meta data, which incorporates the observed sub-populations, their mixtures, as well as other unseen sub-populations that could be approximated by the observed ones. We demonstrate the effectiveness of our method through extensive simulation studies and an application to kidney transplantation data from hundreds of transplant centers.
KL-divergence Based Deep Learning for Discrete Time Model
Liu, Li, Fang, Xiangeng, Wang, Di, Tang, Weijing, He, Kevin
Neural Network (Deep Learning) is a modern model in Artificial Intelligence and it has been exploited in Survival Analysis. Although several improvements have been shown by previous works, training an excellent deep learning model requires a huge amount of data, which may not hold in practice. To address this challenge, we develop a Kullback-Leibler-based (KL) deep learning procedure to integrate external survival prediction models with newly collected time-to-event data. Time-dependent KL discrimination information is utilized to measure the discrepancy between the external and internal data. This is the first work considering using prior information to deal with short data problem in Survival Analysis for deep learning. Simulation and real data results show that the proposed model achieves better performance and higher robustness compared with previous works.
Learning-to-Rank with Partitioned Preference: Fast Estimation for the Plackett-Luce Model
Ma, Jiaqi, Yi, Xinyang, Tang, Weijing, Zhao, Zhe, Hong, Lichan, Chi, Ed H., Mei, Qiaozhu
The industry-scale ranking systems are typically applied to millions of items in a personalized way for billions of users. To We investigate the Plackett-Luce (PL) model meet the need of scalability and to exploit a huge based listwise learning-to-rank (LTR) on amount of user feedback data, learning-to-rank (LTR) data with partitioned preference, where a set has been the most popular paradigm for building the of items are sliced into ordered and disjoint ranking system. Existing LTR approaches can be categorized partitions, but the ranking of items within a into three groups: pointwise (Gey, 1994), pairwise partition is unknown. Given N items with (Burges et al., 2005), and listwise (Cao et al., M partitions, calculating the likelihood of 2007; Taylor et al., 2008) methods. The pointwise and data with partitioned preference under the pairwise LTR methods convert the ranking problem PL model has a time complexity of O(N S!), into regression or classification tasks on single or pairs where S is the maximum size of the top M 1 of items respectively.
SODEN: A Scalable Continuous-Time Survival Model through Ordinary Differential Equation Networks
Tang, Weijing, Ma, Jiaqi, Mei, Qiaozhu, Zhu, Ji
In this paper, we propose a flexible model for survival analysis using neural networks along with scalable optimization algorithms. One key technical challenge for directly applying maximum likelihood estimation (MLE) to censored data is that evaluating the objective function and its gradients with respect to model parameters requires the calculation of integrals. To address this challenge, we recognize that the MLE for censored data can be viewed as a differential-equation constrained optimization problem, a novel perspective. Following this connection, we model the distribution of event time through an ordinary differential equation and utilize efficient ODE solvers and adjoint sensitivity analysis to numerically evaluate the likelihood and the gradients. Using this approach, we are able to 1) provide a broad family of continuous-time survival distributions without strong structural assumptions, 2) obtain powerful feature representations using neural networks, and 3) allow efficient estimation of the model in large-scale applications using stochastic gradient descent. Through both simulation studies and real-world data examples, we demonstrate the effectiveness of the proposed method in comparison to existing state-of-the-art deep learning survival analysis models.
A Flexible Generative Framework for Graph-based Semi-supervised Learning
Ma, Jiaqi, Tang, Weijing, Zhu, Ji, Mei, Qiaozhu
We consider a family of problems that are concerned about making predictions for the majority of unlabeled, graph-structured data samples based on a small proportion of labeled examples. Relational information among the data samples, often encoded in the graph or network structure, is shown to be helpful for these semi-supervised learning tasks. Conventional graph-based regularization methods and recent graph neural networks do not fully leverage the interrelations between the features, the graph, and the labels. We propose a flexible generative framework for graph-based semi-supervised learning, which approaches the joint distribution of the node features, labels, and the graph structure. Borrowing insights from random graph models in network science literature, this joint distribution can be instantiated using various distribution families. For the inference of missing labels, we exploit recent advances of scalable variational inference techniques to approximate the Bayesian posterior. We conduct thorough experiments on benchmark datasets for graph-based semi-supervised learning. Results show that the proposed methods outperform state-of-the-art models under most settings.