influential point
A Differentially Private Kaplan-Meier Estimator for Privacy-Preserving Survival Analysis
Veeraragavan, Narasimha Raghavan, Karimireddy, Sai Praneeth, Nygård, Jan Franz
This paper presents a differentially private approach to Kaplan-Meier estimation that achieves accurate survival probability estimates while safeguarding individual privacy. The Kaplan-Meier estimator is widely used in survival analysis to estimate survival functions over time, yet applying it to sensitive datasets, such as clinical records, risks revealing private information. To address this, we introduce a novel algorithm that applies time-indexed Laplace noise, dynamic clipping, and smoothing to produce a privacy-preserving survival curve while maintaining the cumulative structure of the Kaplan-Meier estimator. By scaling noise over time, the algorithm accounts for decreasing sensitivity as fewer individuals remain at risk, while dynamic clipping and smoothing prevent extreme values and reduce fluctuations, preserving the natural shape of the survival curve. Our results, evaluated on the NCCTG lung cancer dataset, show that the proposed method effectively lowers root mean squared error (RMSE) and enhances accuracy across privacy budgets ($\epsilon$). At $\epsilon = 10$, the algorithm achieves an RMSE as low as 0.04, closely approximating non-private estimates. Additionally, membership inference attacks reveal that higher $\epsilon$ values (e.g., $\epsilon \geq 6$) significantly reduce influential points, particularly at higher thresholds, lowering susceptibility to inference attacks. These findings confirm that our approach balances privacy and utility, advancing privacy-preserving survival analysis.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
Is poisoning a real threat to LLM alignment? Maybe more so than you think
Pathmanathan, Pankayaraj, Chakraborty, Souradip, Liu, Xiangyu, Liang, Yongyuan, Huang, Furong
Recent advancements in Reinforcement Learning with Human Feedback (RLHF) have significantly impacted the alignment of Large Language Models (LLMs). The sensitivity of reinforcement learning algorithms such as Proximal Policy Optimization (PPO) has led to new line work on Direct Policy Optimization (DPO), which treats RLHF in a supervised learning framework. The increased practical use of these RLHF methods warrants an analysis of their vulnerabilities. In this work, we investigate the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning, a first of its kind. We comprehensively analyze DPO's vulnerabilities under different types of attacks, i.e., backdoor and non-backdoor attacks, and different poisoning methods across a wide array of language models, i.e., LLama 7B, Mistral 7B, and Gemma 7B. We find that unlike PPO-based methods, which, when it comes to backdoor attacks, require at least 4\% of the data to be poisoned to elicit harmful behavior, we exploit the true vulnerabilities of DPO more simply so we can poison the model with only as much as 0.5\% of the data. We further investigate the potential reasons behind the vulnerability and how well this vulnerability translates into backdoor vs non-backdoor attacks.
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- Asia > Middle East > Jordan (0.04)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- (3 more...)
Data Science Techniques: How extreme is your data point?
In this article, I will discuss Outliers and Model Selection. When I was an undergraduate student of Science at the University of Waterloo, my lab professor always said to keep all data, even if it is an outlier. This is because we want to keep the authenticity of the data and to be able to make scientific discoveries. Many discoveries have been found on accidents, so let's explore whether you should delete that data point because you drop your hamburger on your experiment or not. Running regression is one thing, but choosing the suitable model and the correct data is another.
How to Make Your Machine Learning Models Robust to Outliers
"So unexpected was the hole that for several years computers analyzing ozone data had systematically thrown out the readings that should have pointed to its growth." According to Wikipedia, an outlier is an observation point that is distant from other observations. This definition is vague because it doesn't quantify the word "distant". In this blog, we'll try to understand the different interpretations of this "distant" notion. We will also look into the outlier detection and treatment techniques while seeing their impact on different types of machine learning models.