AITopics | Kreačić, Eleonora

Collaborating Authors

Kreačić, Eleonora

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning

Wei, Rongzhe, Li, Mufei, Ghassemi, Mohsen, Kreačić, Eleonora, Li, Yifan, Yue, Xiang, Li, Bo, Potluru, Vamsi K., Li, Pan, Chien, Eli

arXiv.org Artificial IntelligenceDec-11-2024

Large Language Models (LLMs) are trained on extensive datasets that often contain sensitive, human-generated information, raising significant concerns about privacy breaches. While certified unlearning approaches offer strong privacy guarantees, they rely on restrictive model assumptions that are not applicable to LLMs. As a result, various unlearning heuristics have been proposed, with the associated privacy risks assessed only empirically. The standard evaluation pipelines typically randomly select data for removal from the training set, apply unlearning techniques, and use membership inference attacks (MIAs) to compare the unlearned models against models retrained without the to-be-unlearned data. However, since every data point is subject to the right to be forgotten, unlearning should be considered in the worst-case scenario from the privacy perspective. Prior work shows that data outliers may exhibit higher memorization effects. Intuitively, they are harder to be unlearn and thus the privacy risk of unlearning them is overlooked and underestimated in the current evaluation. In this paper, we leverage minority data to identify such a critical flaw in previously widely adopted evaluations. We substantiate this claim through carefully designed experiments, including unlearning canaries related to minority groups, inspired by privacy auditing literature. Using personally identifiable information (PII) as a representative minority identifier, we demonstrate that minority groups experience at least 20% more privacy leakage in most cases across six unlearning approaches, three MIAs, three benchmark datasets, and two LLMs of different scales. Given that the right to be forgotten should be upheld for every individual, we advocate for a more rigorous evaluation of LLM unlearning methods. Our minority-aware evaluation framework represents an initial step toward ensuring more equitable and thorough assessments of LLM unlearning efficacy.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2412.08559

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Add feedback

On the Inherent Privacy Properties of Discrete Denoising Diffusion Models

Wei, Rongzhe, Kreačić, Eleonora, Wang, Haoyu, Yin, Haoteng, Chien, Eli, Potluru, Vamsi K., Li, Pan

arXiv.org Artificial IntelligenceFeb-3-2024

Privacy concerns have led to a surge in the creation of synthetic datasets, with diffusion models emerging as a promising avenue. Although prior studies have performed empirical evaluations on these models, there has been a gap in providing a mathematical characterization of their privacy-preserving capabilities. To address this, we present the pioneering theoretical exploration of the privacy preservation inherent in discrete diffusion models (DDMs) for discrete dataset generation. Focusing on per-instance differential privacy (pDP), our framework elucidates the potential privacy leakage for each data point in a given training dataset, offering insights into how the privacy loss of each point correlates with the dataset's distribution. Our bounds also show that training with $s$-sized data points leads to a surge in privacy leakage from $(\epsilon, O(\frac{1}{s^2\epsilon}))$-pDP to $(\epsilon, O(\frac{1}{s\epsilon}))$-pDP of the DDM during the transition from the pure noise to the synthetic clean data phase, and a faster decay in diffusion coefficients amplifies the privacy guarantee. Finally, we empirically verify our theoretical findings on both synthetic and real-world datasets.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2310.15524

Country:

North America > United States (0.27)
Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.92)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Banking & Finance (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
(3 more...)

Add feedback

GraphMaker: Can Diffusion Models Generate Large Attributed Graphs?

Li, Mufei, Kreačić, Eleonora, Potluru, Vamsi K., Li, Pan

arXiv.org Artificial IntelligenceJan-27-2024

Large-scale graphs with node attributes are increasingly common in various real-world applications. Creating synthetic, attribute-rich graphs that mirror real-world examples is crucial, especially for sharing graph data for analysis and developing learning models when original data is restricted to be shared. Traditional graph generation methods are limited in their capacity to handle these complex structures. Recent advances in diffusion models have shown potential in generating graph structures without attributes and smaller molecular graphs. However, these models face challenges in generating large attributed graphs due to the complex attribute-structure correlations and the large size of these graphs. This paper introduces a novel diffusion model, GraphMaker, specifically designed for generating large attributed graphs. We explore various combinations of node attribute and graph structure generation processes, finding that an asynchronous approach more effectively captures the intricate attribute-structure correlations. We also address scalability issues through edge mini-batching generation. To demonstrate the practicality of our approach in graph data dissemination, we introduce a new evaluation pipeline. The evaluation demonstrates that synthetic graphs generated by GraphMaker can be used to develop competitive graph machine learning models for the tasks defined over the original graphs without actually accessing these graphs, while many leading graph generation methods fall short in this evaluation.

artificial intelligence, graph, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2310.13833

Country:

North America > United States (0.14)
Europe (0.14)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Synthetic Data Applications in Finance

Potluru, Vamsi K., Borrajo, Daniel, Coletta, Andrea, Dalmasso, Niccolò, El-Laham, Yousef, Fons, Elizabeth, Ghassemi, Mohsen, Gopalakrishnan, Sriram, Gosai, Vikesh, Kreačić, Eleonora, Mani, Ganapathy, Obitayo, Saheed, Paramanand, Deepak, Raman, Natraj, Solonin, Mikhail, Sood, Srijan, Vyetrenko, Svitlana, Zhu, Haibei, Veloso, Manuela, Balch, Tucker

arXiv.org Artificial IntelligenceDec-29-2023

Synthetic data has made tremendous strides in various commercial settings including finance, healthcare, and virtual reality. We present a broad overview of prototypical applications of synthetic data in the financial sector and in particular provide richer details for a few select ones. These cover a wide variety of data modalities including tabular, time-series, event-series, and unstructured arising from both markets and retail financial applications. Since finance is a highly regulated industry, synthetic data is a potential approach for dealing with issues related to privacy, fairness, and explainability. Various metrics are utilized in evaluating the quality and effectiveness of our approaches in these applications. We conclude with open directions in synthetic data in the context of the financial domain.

large language model, machine learning, pattern recognition, (23 more...)

arXiv.org Artificial Intelligence

2401.00081

Country: North America > United States > California (0.27)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Banking & Finance > Trading (1.00)
(3 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Modeling & Simulation (1.00)
Information Technology > Information Management (1.00)
(11 more...)

Add feedback

Differentially Private Synthetic Data Using KD-Trees

Kreačić, Eleonora, Nouri, Navid, Potluru, Vamsi K., Balch, Tucker, Veloso, Manuela

arXiv.org Artificial IntelligenceJun-19-2023

Creation of a synthetic dataset that faithfully represents the data distribution and simultaneously preserves privacy is a major research challenge. Many space partitioning based approaches have emerged in recent years for answering statistical queries in a differentially private manner. However, for synthetic data generation problem, recent research has been mainly focused on deep generative models. In contrast, we exploit space partitioning techniques together with noise perturbation and thus achieve intuitive and transparent algorithms. We propose both data independent and data dependent algorithms for $\epsilon$-differentially private synthetic data generation whose kernel density resembles that of the real dataset. Additionally, we provide theoretical results on the utility-privacy trade-offs and show how our data dependent approach overcomes the curse of dimensionality and leads to a scalable algorithm. We show empirical utility improvements over the prior work, and discuss performance of our algorithm on a downstream classification task on a real dataset.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2306.13211

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback