Wu, Anpeng
General Information Metrics for Improving AI Model Training Efficiency
Xu, Jianfeng, Liu, Congcong, Tan, Xiaoying, Zhu, Xiaojie, Wu, Anpeng, Wan, Huan, Kong, Weijun, Li, Chun, Xu, Hu, Kuang, Kun, Wu, Fei
Artificial intelligence (AI) is transforming numerous aspects of contemporary life, with advancements fueled largely by the training of models on extensive datasets (Pouyanfar et al. 2018; S. Dong et al. 2021; Bialkova 2024). This is particularly evident in areas like autonomous driving (S. Liu et al. 2024; C. Cui et al. 2024), generative AI (Feuerriegel et al. 2024; Huang et al. 2024), and medical image processing (Tian et al. 2024; Alzubaidi et al. 2024), which depend on large-scale model training. As these models expand to encompass hundreds of billions of parameters, the need for high-quality training data becomes critical (Zhao et al. 2023; Minaee et al. 2024). Training such large-scale models often requires tens to hundreds of trillions of tokens, substantial interdisciplinary effort over months, and a vast array of computational resources, including thousands of GPUs and high levels of energy consumption (Achiam et al. 2023; Touvron, Lavril, et al. 2023; Touvron, Martin, et al. 2023; Chowdhery et al. 2023). A core challenge is ensuring that training data is meticulously curated--ineffective data selection can yield models that underperform, fall short of desired objectives, and waste considerable resources (Chowdhery et al. 2023; Gunasekar et al. 2023b). Thus, once model architecture and algorithms are defined, the quality of the training data becomes paramount to a model's success, significantly influencing the performance and relevance of AI technologies across various domains (Hamid 2023; Zha et al. 2023).By focusing on data quality, small-scale models can achieve performance comparable to much larger models. For instance, Phi-1.5 achieves performance on par with models 5 times its size, while Phi-2 matches or even surpasses the performance of models 25 times larger(Gunasekar et al. 2023a; Y. Li et al. 2023).
Causality for Large Language Models
Wu, Anpeng, Kuang, Kun, Zhu, Minqin, Wang, Yingrong, Zheng, Yujia, Han, Kairong, Li, Baohong, Chen, Guangyi, Wu, Fei, Zhang, Kun
Recent breakthroughs in artificial intelligence have driven a paradigm shift, where large language models (LLMs) with billions or trillions of parameters are trained on vast datasets, achieving unprecedented success across a series of language tasks. However, despite these successes, LLMs still rely on probabilistic modeling, which often captures spurious correlations rooted in linguistic patterns and social stereotypes, rather than the true causal relationships between entities and events. This limitation renders LLMs vulnerable to issues such as demographic biases, social stereotypes, and LLM hallucinations. These challenges highlight the urgent need to integrate causality into LLMs, moving beyond correlation-driven paradigms to build more reliable and ethically aligned AI systems. While many existing surveys and studies focus on utilizing prompt engineering to activate LLMs for causal knowledge or developing benchmarks to assess their causal reasoning abilities, most of these efforts rely on human intervention to activate pre-trained models. How to embed causality into the training process of LLMs and build more general and intelligent models remains unexplored. Recent research highlights that LLMs function as causal parrots, capable of reciting causal knowledge without truly understanding or applying it. These prompt-based methods are still limited to human interventional improvements. This survey aims to address this gap by exploring how causality can enhance LLMs at every stage of their lifecycle-from token embedding learning and foundation model training to fine-tuning, alignment, inference, and evaluation-paving the way for more interpretable, reliable, and causally-informed models. Additionally, we further outline six promising future directions to advance LLM development, enhance their causal reasoning capabilities, and address the current limitations these models face.
Stable Heterogeneous Treatment Effect Estimation across Out-of-Distribution Populations
Zhang, Yuling, Wu, Anpeng, Kuang, Kun, Du, Liang, Sun, Zixun, Wang, Zhi
Heterogeneous treatment effect (HTE) estimation is vital for understanding the change of treatment effect across individuals or subgroups. Most existing HTE estimation methods focus on addressing selection bias induced by imbalanced distributions of confounders between treated and control units, but ignore distribution shifts across populations. Thereby, their applicability has been limited to the in-distribution (ID) population, which shares a similar distribution with the training dataset. In real-world applications, where population distributions are subject to continuous changes, there is an urgent need for stable HTE estimation across out-of-distribution (OOD) populations, which, however, remains an open problem. As pioneers in resolving this problem, we propose a novel Stable Balanced Representation Learning with Hierarchical-Attention Paradigm (SBRL-HAP) framework, which consists of 1) Balancing Regularizer for eliminating selection bias, 2) Independence Regularizer for addressing the distribution shift issue, 3) Hierarchical-Attention Paradigm for coordination between balance and independence. In this way, SBRL-HAP regresses counterfactual outcomes using ID data, while ensuring the resulting HTE estimation can be successfully generalized to out-of-distribution scenarios, thereby enhancing the model's applicability in real-world settings. Extensive experiments conducted on synthetic and real-world datasets demonstrate the effectiveness of our SBRL-HAP in achieving stable HTE estimation across OOD populations, with an average 10% reduction in the error metric PEHE and 11% decrease in the ATE bias, compared to the SOTA methods.
Learning Discrete Latent Variable Structures with Tensor Rank Conditions
Chen, Zhengming, Cai, Ruichu, Xie, Feng, Qiao, Jie, Wu, Anpeng, Li, Zijian, Hao, Zhifeng, Zhang, Kun
Unobserved discrete data are ubiquitous in many scientific disciplines, and how to learn the causal structure of these latent variables is crucial for uncovering data patterns. Most studies focus on the linear latent variable model or impose strict constraints on latent structures, which fail to address cases in discrete data involving non-linear relationships or complex latent structures. To achieve this, we explore a tensor rank condition on contingency tables for an observed variable set $\mathbf{X}_p$, showing that the rank is determined by the minimum support of a specific conditional set (not necessary in $\mathbf{X}_p$) that d-separates all variables in $\mathbf{X}_p$. By this, one can locate the latent variable through probing the rank on different observed variables set, and further identify the latent causal structure under some structure assumptions. We present the corresponding identification algorithm and conduct simulated experiments to verify the effectiveness of our method. In general, our results elegantly extend the identification boundary for causal discovery with discrete latent variables and expand the application scope of causal discovery with latent variables.
Contrastive Balancing Representation Learning for Heterogeneous Dose-Response Curves Estimation
Zhu, Minqin, Wu, Anpeng, Li, Haoxuan, Xiong, Ruoxuan, Li, Bo, Yang, Xiaoqing, Qin, Xuan, Zhen, Peng, Guo, Jiecheng, Wu, Fei, Kuang, Kun
Estimating the individuals' potential response to varying treatment doses is crucial for decision-making in areas such as precision medicine and management science. Most recent studies predict counterfactual outcomes by learning a covariate representation that is independent of the treatment variable. However, such independence constraints neglect much of the covariate information that is useful for counterfactual prediction, especially when the treatment variables are continuous. To tackle the above issue, in this paper, we first theoretically demonstrate the importance of the balancing and prognostic representations for unbiased estimation of the heterogeneous dose-response curves, that is, the learned representations are constrained to satisfy the conditional independence between the covariates and both of the treatment variables and the potential responses. Based on this, we propose a novel Contrastive balancing Representation learning Network using a partial distance measure, called CRNet, for estimating the heterogeneous dose-response curves without losing the continuity of treatments. Extensive experiments are conducted on synthetic and real-world datasets demonstrating that our proposal significantly outperforms previous methods.
Pareto-Optimal Estimation and Policy Learning on Short-term and Long-term Treatment Effects
Wang, Yingrong, Wu, Anpeng, Li, Haoxuan, Liu, Weiming, Miao, Qiaowei, Xiong, Ruoxuan, Wu, Fei, Kuang, Kun
This paper focuses on developing Pareto-optimal estimation and policy learning to identify the most effective treatment that maximizes the total reward from both short-term and long-term effects, which might conflict with each other. For example, a higher dosage of medication might increase the speed of a patient's recovery (short-term) but could also result in severe long-term side effects. Although recent works have investigated the problems about short-term or long-term effects or the both, how to trade-off between them to achieve optimal treatment remains an open challenge. Moreover, when multiple objectives are directly estimated using conventional causal representation learning, the optimization directions among various tasks can conflict as well. In this paper, we systematically investigate these issues and introduce a Pareto-Efficient algorithm, comprising Pareto-Optimal Estimation (POE) and Pareto-Optimal Policy Learning (POPL), to tackle them. POE incorporates a continuous Pareto module with representation balancing, enhancing estimation efficiency across multiple tasks. As for POPL, it involves deriving short-term and long-term outcomes linked with various treatment levels, facilitating an exploration of the Pareto frontier emanating from these outcomes. Results on both the synthetic and real-world datasets demonstrate the superiority of our method.
Hierarchical Topological Ordering with Conditional Independence Test for Limited Time Series
Wu, Anpeng, Li, Haoxuan, Kuang, Kun, Zhang, Keli, Wu, Fei
Learning directed acyclic graphs (DAGs) to identify causal relations underlying observational data is crucial but also poses significant challenges. Recently, topology-based methods have emerged as a two-step approach to discovering DAGs by first learning the topological ordering of variables and then eliminating redundant edges, while ensuring that the graph remains acyclic. However, one limitation is that these methods would generate numerous spurious edges that require subsequent pruning. To overcome this limitation, in this paper, we propose an improvement to topology-based methods by introducing limited time series data, consisting of only two cross-sectional records that need not be adjacent in time and are subject to flexible timing. By incorporating conditional instrumental variables as exogenous interventions, we aim to identify descendant nodes for each variable. Following this line, we propose a hierarchical topological ordering algorithm with conditional independence test (HT-CIT), which enables the efficient learning of sparse DAGs with a smaller search space compared to other popular approaches. The HT-CIT algorithm greatly reduces the number of edges that need to be pruned. Empirical results from synthetic and real-world datasets demonstrate the superiority of the proposed HT-CIT algorithm.
Instrumental Variables in Causal Inference and Machine Learning: A Survey
Wu, Anpeng, Kuang, Kun, Xiong, Ruoxuan, Wu, Fei
Causal inference is the process of using assumptions, study designs, and estimation strategies to draw conclusions about the causal relationships between variables based on data. This allows researchers to better understand the underlying mechanisms at work in complex systems and make more informed decisions. In many settings, we may not fully observe all the confounders that affect both the treatment and outcome variables, complicating the estimation of causal effects. To address this problem, a growing literature in both causal inference and machine learning proposes to use Instrumental Variables (IV). This paper serves as the first effort to systematically and comprehensively introduce and discuss the IV methods and their applications in both causal inference and machine learning. First, we provide the formal definition of IVs and discuss the identification problem of IV regression methods under different assumptions. Second, we categorize the existing work on IV methods into three streams according to the focus on the proposed methods, including two-stage least squares with IVs, control function with IVs, and evaluation of IVs. For each stream, we present both the classical causal inference methods, and recent developments in the machine learning literature. Then, we introduce a variety of applications of IV methods in real-world scenarios and provide a summary of the available datasets and algorithms. Finally, we summarize the literature, discuss the open problems and suggest promising future research directions for IV methods and their applications. We also develop a toolkit of IVs methods reviewed in this survey at https://github.com/causal-machine-learning-lab/mliv.
Learning Instrumental Variable from Data Fusion for Treatment Effect Estimation
Wu, Anpeng, Kuang, Kun, Xiong, Ruoxuan, Zhu, Minqing, Liu, Yuxuan, Li, Bo, Liu, Furui, Wang, Zhihua, Wu, Fei
The advent of the big data era brought new opportunities and challenges to draw treatment effect in data fusion, that is, a mixed dataset collected from multiple sources (each source with an independent treatment assignment mechanism). Due to possibly omitted source labels and unmeasured confounders, traditional methods cannot estimate individual treatment assignment probability and infer treatment effect effectively. Therefore, we propose to reconstruct the source label and model it as a Group Instrumental Variable (GIV) to implement IV-based Regression for treatment effect estimation. In this paper, we conceptualize this line of thought and develop a unified framework (Meta-EM) to (1) map the raw data into a representation space to construct Linear Mixed Models for the assigned treatment variable; (2) estimate the distribution differences and model the GIV for the different treatment assignment mechanisms; and (3) adopt an alternating training strategy to iteratively optimize the representations and the joint distribution to model GIV for IV regression. Empirical results demonstrate the advantages of our Meta-EM compared with state-of-the-art methods.
Confounder Balancing for Instrumental Variable Regression with Latent Variable
Wu, Anpeng, Kuang, Kun, Xiong, Ruoxuan, Li, Bo, Wu, Fei
This paper studies the confounding effects from the unmeasured confounders and the imbalance of observed confounders in IV regression and aims at unbiased causal effect estimation. Recently, nonlinear IV estimators were proposed to allow for nonlinear model in both stages. However, the observed confounders may be imbalanced in stage 2, which could still lead to biased treatment effect estimation in certain cases. To this end, we propose a Confounder Balanced IV Regression (CB-IV) algorithm to jointly remove the bias from the unmeasured confounders and the imbalance of observed confounders. Theoretically, by redefining and solving an inverse problem for potential outcome function, we show that our CB-IV algorithm can unbiasedly estimate treatment effects and achieve lower variance. The IV methods have a major disadvantage in that little prior or theory is currently available to pre-define a valid IV in real-world scenarios. Thus, we study two more challenging settings without pre-defined valid IVs: (1) indistinguishable IVs implicitly present in observations, i.e., mixed-variable challenge, and (2) latent IVs don't appear in observations, i.e., latent-variable challenge. To address these two challenges, we extend our CB-IV by a latent-variable module, namely CB-IV-L algorithm. Extensive experiments demonstrate that our CB-IV(-L) outperforms the existing approaches.