loss gap
SoftAdaClip: A Smooth Clipping Strategy for Fair and Private Model Training
Soleymani, Dorsa, Dadsetan, Ali, Rudzicz, Frank
Differential privacy (DP) provides strong protection for sensitive data, but often reduces model performance and fairness, especially for underrepresented groups. One major reason is gradient clipping in DP-SGD, which can disproportionately suppress learning signals for minority subpopulations. Although adaptive clipping can enhance utility, it still relies on uniform hard clipping, which may restrict fairness. To address this, we introduce SoftAdaClip, a differentially private training method that replaces hard clipping with a smooth, tanh-based transformation to preserve relative gradient magnitudes while bounding sensitivity. We evaluate SoftAdaClip on various datasets, including MIMIC-III (clinical text), GOSSIS-eICU (structured healthcare), and Adult Income (tabular data). Our results show that SoftAdaClip reduces subgroup disparities by up to 87% compared to DP-SGD and up to 48% compared to Adaptive-DPSGD, and these reductions in subgroup disparities are statistically significant. These findings underscore the importance of integrating smooth transformations with adaptive mechanisms to achieve fair and private model training.
Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models
Chen, Pin-Yu, Shen, Han, Das, Payel, Chen, Tianyi
Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.
Defending Membership Inference Attacks via Privacy-aware Sparsity Tuning
Hu, Qiang, Zhang, Hengxiang, Wei, Hongxin
Over-parameterized models are typically vulnerable to membership inference attacks, which aim to determine whether a specific sample is included in the training of a given model. In light of this, we propose Privacy-aware Sparsity Tuning (PAST)--a simple fix to the l1 Regularization--by employing adaptive penalties to different parameters. Our key idea behind PAST is to promote sparsity in parameters that significantly contribute to privacy leakage. In particular, we construct the adaptive weight for each parameter based on its privacy sensitivity, i.e., the gradient of the loss gap with respect to the parameter. Using PAST, the network shrinks the loss gap between members and non-members, leading to strong resistance to privacy attacks. Extensive experiments demonstrate the superiority of PAST, achieving a state-of-the-art balance in the privacy-utility trade-off. Modern neural networks are trained in an over-parameterized regime where the parameters of the model exceed the size of the training set (Zhang et al., 2021). While the huge amount of parameters empowers the models to achieve impressive performance across various tasks, the strong capacity also makes them particularly vulnerable to membership inference attacks (MIAs) (Shokri et al., 2017). In MIAs, attackers aim to detect if a sample is utilized in the training of a target model. Membership inference can cause security and privacy concerns in cases where the target model is trained on sensitive information, like health care (Paul et al., 2021), financial service (Mahalle et al., 2018), and DNA sequence analysis (Arshad et al., 2021).
Challenges in Detoxifying Language Models
Welbl, Johannes, Glaese, Amelia, Uesato, Jonathan, Dathathri, Sumanth, Mellor, John, Hendricks, Lisa Anne, Anderson, Kirsty, Kohli, Pushmeet, Coppin, Ben, Huang, Po-Sen
Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world; to this end, prior work often relies on automatic evaluation of LM toxicity. We critically discuss this approach, evaluate several toxicity mitigation strategies with respect to both automatic and human evaluation, and analyze consequences of toxicity mitigation in terms of model bias and LM quality. We demonstrate that while basic intervention strategies can effectively optimize previously established automatic metrics on the RealToxicityPrompts dataset, this comes at the cost of reduced LM coverage for both texts about, and dialects of, marginalized groups. Additionally, we find that human raters often disagree with high automatic toxicity scores after strong toxicity reduction interventions -- highlighting further the nuances involved in careful evaluation of LM toxicity.
Provable trade-offs between private & robust machine learning
Historically, machine learning methods have not been designed with security in mind. In turn, this has given rise to adversarial examples, carefully perturbed input samples aimed to mislead detection at test time, which have been applied to attack spam and malware classification, and more recently to attack image classification. Consequently, an abundance of research has been devoted to designing machine learning methods that are robust to adversarial examples. Unfortunately, there are desiderata besides robustness that a secure and safe machine learning model must satisfy, such as fairness and privacy. Recent work by Song et al. (2019) has shown, empirically, that there exists a trade-off between robust and private machine learning models. Models designed to be robust to adversarial examples often overfit on training data to a larger extent than standard (non-robust) models. If a dataset contains private information, then any statistical test that separates training and test data by observing a model's outputs can represent a privacy breach, and if a model overfits on training data, these statistical tests become easier. In this work, we identify settings where standard models will provably overfit to a larger extent in comparison to robust models, and as empirically observed in previous works, settings where the opposite behavior occurs. Thus, it is not necessarily the case that privacy must be sacrificed to achieve robustness. The degree of overfitting naturally depends on the amount of data available for training. We go on to formally characterize how the training set size factors into the privacy risks exposed by training a robust model. Finally, we empirically show our findings hold on image classification benchmark datasets, such as CIFAR-10.