Li, Miaomiao
Understanding and Mitigating the Bias Inheritance in LLM-based Data Augmentation on Downstream Tasks
Li, Miaomiao, Chen, Hao, Wang, Yang, Zhu, Tingyuan, Zhang, Weijia, Zhu, Kaijie, Wong, Kam-Fai, Wang, Jindong
Generating synthetic datasets via large language models (LLMs) themselves has emerged as a promising approach to improve LLM performance. However, LLMs inherently reflect biases present in their training data, leading to a critical challenge: when these models generate synthetic data for training, they may propagate and amplify their inherent biases that can significantly impact model fairness and robustness on downstream tasks--a phenomenon we term bias inheritance. This work presents the first systematic investigation in understanding, analyzing, and mitigating bias inheritance. We study this problem by fine-tuning LLMs with a combined dataset consisting of original and LLM-augmented data, where bias ratio represents the proportion of augmented data. Through systematic experiments across 10 classification and generation tasks, we analyze how 6 different types of biases manifest at varying bias ratios. Our results reveal that bias inheritance has nuanced effects on downstream tasks, influencing both classification tasks and generation tasks differently. Then, our analysis identifies three key misalignment factors: misalignment of values, group data, and data distributions. Based on these insights, we propose three mitigation strategies: token-based, mask-based, and loss-based approaches. Experiments demonstrate that these strategies also work differently on various tasks and bias, indicating the substantial challenges to fully mitigate bias inheritance. We hope this work can provide valuable insights to the research of LLM data augmentation.
RulePrompt: Weakly Supervised Text Classification with Prompting PLMs and Self-Iterative Logical Rules
Li, Miaomiao, Zhu, Jiaqi, Wang, Yang, Yang, Yi, Li, Yilin, Wang, Hongan
Weakly supervised text classification (WSTC), also called zero-shot or dataless text classification, has attracted increasing attention due to its applicability in classifying a mass of texts within the dynamic and open Web environment, since it requires only a limited set of seed words (label names) for each category instead of labeled data. With the help of recently popular prompting Pre-trained Language Models (PLMs), many studies leveraged manually crafted and/or automatically identified verbalizers to estimate the likelihood of categories, but they failed to differentiate the effects of these category-indicative words, let alone capture their correlations and realize adaptive adjustments according to the unlabeled corpus. In this paper, in order to let the PLM effectively understand each category, we at first propose a novel form of rule-based knowledge using logical expressions to characterize the meanings of categories. Then, we develop a prompting PLM-based approach named RulePrompt for the WSTC task, consisting of a rule mining module and a rule-enhanced pseudo label generation module, plus a self-supervised fine-tuning module to make the PLM align with this task. Within this framework, the inaccurate pseudo labels assigned to texts and the imprecise logical rules associated with categories mutually enhance each other in an alternative manner. That establishes a self-iterative closed loop of knowledge (rule) acquisition and utilization, with seed words serving as the starting point. Extensive experiments validate the effectiveness and robustness of our approach, which markedly outperforms state-of-the-art weakly supervised methods. What is more, our approach yields interpretable category rules, proving its advantage in disambiguating easily-confused categories.
Multiple Kernel k-Means with Incomplete Kernels
Liu, Xinwang (National University of Defense Technology) | Li, Miaomiao (National University of Defense Technology) | Wang, Lei (University of Wollongong) | Dou, Yong (National University of Defense Technology) | Yin, Jianping (National University of Defense Technology) | Zhu, En (National University of Defense Technology)
Multiple kernel clustering (MKC) algorithms optimally combine a group of pre-specified base kernels to improve clustering performance. However, existing MKC algorithms cannot efficiently address the situation where some rows and columns of base kernels are absent. This paper proposes a simple while effective algorithm to address this issue. Different from existing approaches where incomplete kernels are firstly imputed and a standard MKC algorithm is applied to the imputed kernels, our algorithm integrates imputation and clustering into a unified learning procedure. Specifically, we perform multiple kernel clustering directly with the presence of incomplete kernels, which are treated as auxiliary variables to be jointly optimized. Our algorithm does not require that there be at least one complete base kernel over all the samples. Also, it adaptively imputes incomplete kernels and combines them to best serve clustering. A three-step iterative algorithm with proved convergence is designed to solve the resultant optimization problem. Extensive experiments are conducted on four benchmark data sets to compare the proposed algorithm with existing imputation-based methods. Our algorithm consistently achieves superior performance and the improvement becomes more significant with increasing missing ratio, verifying the effectiveness and advantages of the proposed joint imputation and clustering.
Optimal Neighborhood Kernel Clustering with Multiple Kernels
Liu, Xinwang (National University of Defense Technology) | Zhou, Sihang (National University of Defense Technology) | Wang, Yueqing (National University of Defense Technology) | Li, Miaomiao (National University of Defense Technology) | Dou, Yong (National University of Defense Technology) | Zhu, En (National University of Defense Technology) | Yin, Jianping (National University of Defense Technology)
Multiple kernel $k$-means (MKKM) aims to improve clustering performance by learning an optimal kernel, which is usually assumed to be a linear combination of a group of pre-specified base kernels. However, we observe that this assumption could: i) cause limited kernel representation capability; and ii) not sufficiently consider the negotiation between the process of learning the optimal kernel and that of clustering, leading to unsatisfying clustering performance. To address these issues, we propose an optimal neighborhood kernel clustering (ONKC) algorithm to enhance the representability of the optimal kernel and strengthen the negotiation between kernel learning and clustering. We theoretically justify this ONKC by revealing its connection with existing MKKM algorithms. Furthermore, this justification shows that existing MKKM algorithms can be viewed as a special case of our approach and indicates the extendability of the proposed ONKC for designing better clustering algorithms. An efficient algorithm with proved convergence is designed to solve the resultant optimization problem. Extensive experiments have been conducted to evaluate the clustering performance of the proposed algorithm. As demonstrated, our algorithm significantly outperforms the state-of-the-art ones in the literature, verifying the effectiveness and advantages of ONKC.