Goto

Collaborating Authors

 influential point


A Differentially Private Kaplan-Meier Estimator for Privacy-Preserving Survival Analysis

Veeraragavan, Narasimha Raghavan, Karimireddy, Sai Praneeth, Nygård, Jan Franz

arXiv.org Artificial Intelligence

This paper presents a differentially private approach to Kaplan-Meier estimation that achieves accurate survival probability estimates while safeguarding individual privacy. The Kaplan-Meier estimator is widely used in survival analysis to estimate survival functions over time, yet applying it to sensitive datasets, such as clinical records, risks revealing private information. To address this, we introduce a novel algorithm that applies time-indexed Laplace noise, dynamic clipping, and smoothing to produce a privacy-preserving survival curve while maintaining the cumulative structure of the Kaplan-Meier estimator. By scaling noise over time, the algorithm accounts for decreasing sensitivity as fewer individuals remain at risk, while dynamic clipping and smoothing prevent extreme values and reduce fluctuations, preserving the natural shape of the survival curve. Our results, evaluated on the NCCTG lung cancer dataset, show that the proposed method effectively lowers root mean squared error (RMSE) and enhances accuracy across privacy budgets ($\epsilon$). At $\epsilon = 10$, the algorithm achieves an RMSE as low as 0.04, closely approximating non-private estimates. Additionally, membership inference attacks reveal that higher $\epsilon$ values (e.g., $\epsilon \geq 6$) significantly reduce influential points, particularly at higher thresholds, lowering susceptibility to inference attacks. These findings confirm that our approach balances privacy and utility, advancing privacy-preserving survival analysis.


Is poisoning a real threat to LLM alignment? Maybe more so than you think

Pathmanathan, Pankayaraj, Chakraborty, Souradip, Liu, Xiangyu, Liang, Yongyuan, Huang, Furong

arXiv.org Artificial Intelligence

Recent advancements in Reinforcement Learning with Human Feedback (RLHF) have significantly impacted the alignment of Large Language Models (LLMs). The sensitivity of reinforcement learning algorithms such as Proximal Policy Optimization (PPO) has led to new line work on Direct Policy Optimization (DPO), which treats RLHF in a supervised learning framework. The increased practical use of these RLHF methods warrants an analysis of their vulnerabilities. In this work, we investigate the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning, a first of its kind. We comprehensively analyze DPO's vulnerabilities under different types of attacks, i.e., backdoor and non-backdoor attacks, and different poisoning methods across a wide array of language models, i.e., LLama 7B, Mistral 7B, and Gemma 7B. We find that unlike PPO-based methods, which, when it comes to backdoor attacks, require at least 4\% of the data to be poisoned to elicit harmful behavior, we exploit the true vulnerabilities of DPO more simply so we can poison the model with only as much as 0.5\% of the data. We further investigate the potential reasons behind the vulnerability and how well this vulnerability translates into backdoor vs non-backdoor attacks.


Data Science Techniques: How extreme is your data point?

#artificialintelligence

In this article, I will discuss Outliers and Model Selection. When I was an undergraduate student of Science at the University of Waterloo, my lab professor always said to keep all data, even if it is an outlier. This is because we want to keep the authenticity of the data and to be able to make scientific discoveries. Many discoveries have been found on accidents, so let's explore whether you should delete that data point because you drop your hamburger on your experiment or not. Running regression is one thing, but choosing the suitable model and the correct data is another.


How to Make Your Machine Learning Models Robust to Outliers

#artificialintelligence

"So unexpected was the hole that for several years computers analyzing ozone data had systematically thrown out the readings that should have pointed to its growth." According to Wikipedia, an outlier is an observation point that is distant from other observations. This definition is vague because it doesn't quantify the word "distant". In this blog, we'll try to understand the different interpretations of this "distant" notion. We will also look into the outlier detection and treatment techniques while seeing their impact on different types of machine learning models.