Goto

Collaborating Authors

 gdp


Gaussian Differential Privacy on Riemannian Manifolds

Neural Information Processing Systems

We develop an advanced approach for extending Gaussian Differential Privacy (GDP) to general Riemannian manifolds. The concept of GDP stands out as a prominent privacy definition that strongly warrants extension to manifold settings, due to its central limit properties. By harnessing the power of the renowned Bishop-Gromov theorem in geometric analysis, we propose a Riemannian Gaussian distribution that integrates the Riemannian distance, allowing us to achieve GDP in Riemannian manifolds with bounded Ricci curvature. To the best of our knowledge, this work marks the first instance of extending the GDP framework to accommodate general Riemannian manifolds, encompassing curved spaces, and circumventing the reliance on tangent space summaries. We provide a simple algorithm to evaluate the privacy budget $\mu$ on any one-dimensional manifold and introduce a versatile Markov Chain Monte Carlo (MCMC)-based algorithm to calculate $\mu$ on any Riemannian manifold with constant curvature. Through simulations on one of the most prevalent manifolds in statistics, the unit sphere $S^d$, we demonstrate the superior utility of our Riemannian Gaussian mechanism in comparison to the previously proposed Riemannian Laplace mechanism for implementing GDP.



General Demographic Foundation Models for Enhancing Predictive Performance Across Diseases and Populations

Chen, Li-Chin, Sheu, Ji-Tian, Chuang, Yuh-Jue

arXiv.org Artificial Intelligence

Demographic attributes are universally present in electronic health records. They are the most widespread information across populations and diseases, and serve as vital predictors in clinical risk stratification and treatment decisions. Despite their significance, these attributes are often treated as auxiliaries in model design, with limited attention being paid to learning their representations. This study explored the development of a General Demographic Pre-trained (GDP) model as a foundational model tailored to demographic attributes, focusing on age and gender. The model is pre-trained and evaluated using datasets with diverse diseases and populations compositions from different geographic regions. The composition of GDP architecture was explored through examining combinations of ordering approaches and encoding methods to transform tabular demographic inputs into effective latent embeddings. Results demonstrate the feasibility of GDP to generalize across task, diseases, and populations. In detailed composition, the sequential ordering substantially improves model performance in discrimination, calibration, and the corresponding information gain at each decision tree split, particularly in diseases where age and gender contribute significantly to risk stratification. Even in datasets where demographic attributes hold relatively low predictive value, GDP enhances the representational importance, increasing their influence in downstream gradient boosting models. The findings suggest that foundation models for tabular demographic attributes offer a promising direction for improving predictive performance in healthcare applications.


Generative Foundation Model for Structured and Unstructured Electronic Health Records

Sivarajkumar, Sonish, Zhang, Hang, Ji, Yuelyu, Bilalpur, Maneesh, Wu, Xizhi, Li, Chenyu, Kwak, Min Gu, Visweswaran, Shyam, Wang, Yanshan

arXiv.org Artificial Intelligence

Electronic health records (EHRs) are rich clinical data sources but complex repositories of patient data, spanning structured elements (demographics, vitals, lab results, codes), unstructured clinical notes and other modalities of data. Harnessing this heterogeneity is critical for improving patient outcomes. Recent advances in large language models (LLMs) have enabled foundation models that can learn from multiple data modalities and support clinical tasks. However, most current approaches simply serialize numeric EHR data into text, which risks losing temporal and quantitative detail. We introduce Generative Deep Patient (GDP), a multimodal foundation model that natively encodes structured EHR time-series via a CNN-Transformer encoder and fuses it with unstructured EHRs through cross-modal attention into a LLaMA-based decoder. GDP is trained in two stages: (1) generative pretraining, where it learns to produce clinical narratives from raw patient timelines while also performing masked feature prediction (MFP) and next time-step prediction (NTP) to capture temporal dynamics; and (2) multi-task fine-tuning for clinically meaningful predictions (e.g., heart failure, type 2 diabetes, 30-day readmission). In clinical prediction, GDP demonstrated superior performance on MIMIC-IV: heart failure AUROC = 0.923, type 2 diabetes AUROC = 0.817, and 30-day readmission AUROC = 0.627. For narrative generation, GDP achieved ROUGE-L = 0.135 and BERTScore-F1 = 0.545. In a blinded human evaluation, GDP-Instruct scored highest on faithfulness, fluency, and overall clinical utility, suggesting reduced hospital documentation workload without sacrificing accuracy. Our results demonstrate that a single multimodal foundation model can both predict clinically actionable events and generate high-quality clinical narratives. Furthermore, GDP's flexible architecture can be extended to additional modalities.



Deep learning four decades of human migration

Gaskin, Thomas, Abel, Guy J.

arXiv.org Artificial Intelligence

W e present a novel and detailed dataset on origin-destination annual migration flows and stocks between 230 countries and regions, spanning the period from 1990 to the present. Our flow estimates are further disaggregated by country of birth, providing a comprehensive picture of migration over the last 35 years. The estimates are obtained by training a deep recurrent neural network to learn flow patterns from 18 covariates for all countries, including geographic, economic, cultural, societal, and political information. The recurrent architecture of the neural network means that the entire past can influence current migration patterns, allowing us to learn long-range temporal correlations. By training an ensemble of neural networks and additionally pushing uncertainty on the covariates through the trained network, we obtain confidence bounds for all our estimates, allowing researchers to pinpoint the geographic regions most in need of additional data collection. W e validate our approach on various test sets of unseen data, demonstrating that it significantly outperforms traditional methods estimating five-year flows while delivering a significant increase in temporal resolution. The model is fully open source: all training data, neural network weights, and training code are made public alongside the migration estimates, providing a valuable resource for future studies of human migration.


$(\varepsilon, \delta)$ Considered Harmful: Best Practices for Reporting Differential Privacy Guarantees

Gomez, Juan Felipe, Kulynych, Bogdan, Kaissis, Georgios, Hayes, Jamie, Balle, Borja, Honkela, Antti

arXiv.org Machine Learning

Differential privacy (DP) (Dwork et al., 2006; Dwork & Roth, 2014) has emerged as the gold standard for privacypreserving machine learning with provable privacy guarantees. The past two decades have seen significant progress in understanding the precise privacy properties of different algorithms as well as the emergence of many new privacy formalisms (Desfontaines & Pejó, 2020). Despite the multitude of formalisms, the gold standard of reporting privacy guarantees has been to use (ε, δ)- DP (Dwork & Roth, 2014) with a fixed and small δ. The parameter δ is commonly suggested to be significantly smaller than 1/N for a dataset of N individuals, e.g., cryptographically small (Vadhan, 2017; Ponomareva et al., 2023), however, exact values vary in the literature, and δ is ultimately an arbitrary parameter that practitioners must choose ad-hoc. This arbitrariness leads to downstream problems, the most important of which is that the privacy budget ε is incomparable across algorithms (Kaissis et al., 2024). Additionally, (ε, δ)-DP with single δ is a poor representation of actual privacy guarantees of most practical machine learning algorithms, which leads to severe overestimation of risk when converting it to interpretable bounds on success rates of attacks aiming to infer private information in the training data (Kulynych et al., 2024), as illustrated in Figure 1. In this paper, we make the empirical observation that various practical deployments of DP machine learning algorithms, when analysed with modern numerical algorithms known as accountants (Koskela & Honkela, 2021; Gopi et al., 2021; Alghamdi et al., 2023; Doroshenko et al., 2022), are almost exactly characterized by a notion of privacy known as Gaussian DP (GDP) (Dong et al., 2022). In particular, we observe this behavior for DP largescale image classification (De et al., 2022), and the TopDown algorithm for the U.S. Decennial Census (Abowd et al., 2022). This observation is also consistent with the fact that the privacy of the widely used Gaussian mechanism (Dwork & Roth, 2014) is perfectly captured by GDP, and according to the Central Limit Theorem of DP (Dong et al., 2022), the privacy guarantees of a composed algorithm, i.e., one that consists of many applications of simpler building-block DP algorithms, approach those of the Gaussian mechanism.


Generative Distribution Prediction: A Unified Approach to Multimodal Learning

Tian, Xinyu, Shen, Xiaotong

arXiv.org Machine Learning

Accurate prediction with multimodal data-encompassing tabular, textual, and visual inputs or outputs-is fundamental to advancing analytics in diverse application domains. Traditional approaches often struggle to integrate heterogeneous data types while maintaining high predictive accuracy. We introduce Generative Distribution Prediction (GDP), a novel framework that leverages multimodal synthetic data generation-such as conditional diffusion models-to enhance predictive performance across structured and unstructured modalities. GDP is model-agnostic, compatible with any high-fidelity generative model, and supports transfer learning for domain adaptation. We establish a rigorous theoretical foundation for GDP, providing statistical guarantees on its predictive accuracy when using diffusion models as the generative backbone. By estimating the data-generating distribution and adapting to various loss functions for risk minimization, GDP enables accurate point predictions across multimodal settings. We empirically validate GDP on four supervised learning tasks-tabular data prediction, question answering, image captioning, and adaptive quantile regression-demonstrating its versatility and effectiveness across diverse domains.


Improving Decoupled Posterior Sampling for Inverse Problems using Data Consistency Constraint

Qi, Zhi, Yuan, Shihong, Yuan, Yuyin, Kuang, Linling, Kabashima, Yoshiyuki, Meng, Xiangming

arXiv.org Machine Learning

Diffusion models have shown strong performances in solving inverse problems through posterior sampling while they suffer from errors during earlier steps. To mitigate this issue, several Decoupled Posterior Sampling methods have been recently proposed. However, the reverse process in these methods ignores measurement information, leading to errors that impede effective optimization in subsequent steps. To solve this problem, we propose Guided Decoupled Posterior Sampling (GDPS) by integrating a data consistency constraint in the reverse process. The constraint performs a smoother transition within the optimization process, facilitating a more effective convergence toward the target distribution. Furthermore, we extend our method to latent diffusion models and Tweedie's formula, demonstrating its scalability. We evaluate GDPS on the FFHQ and ImageNet datasets across various linear and nonlinear tasks under both standard and challenging conditions. Experimental results demonstrate that GDPS achieves state-of-the-art performance, improving accuracy over existing methods.


Gaussian Differential Privacy on Riemannian Manifolds

Neural Information Processing Systems

We develop an advanced approach for extending Gaussian Differential Privacy (GDP) to general Riemannian manifolds. The concept of GDP stands out as a prominent privacy definition that strongly warrants extension to manifold settings, due to its central limit properties. By harnessing the power of the renowned Bishop-Gromov theorem in geometric analysis, we propose a Riemannian Gaussian distribution that integrates the Riemannian distance, allowing us to achieve GDP in Riemannian manifolds with bounded Ricci curvature. To the best of our knowledge, this work marks the first instance of extending the GDP framework to accommodate general Riemannian manifolds, encompassing curved spaces, and circumventing the reliance on tangent space summaries. We provide a simple algorithm to evaluate the privacy budget \mu on any one-dimensional manifold and introduce a versatile Markov Chain Monte Carlo (MCMC)-based algorithm to calculate \mu on any Riemannian manifold with constant curvature.