AITopics | Zhao, Yunpeng

Collaborating Authors

Zhao, Yunpeng

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Does Training with Synthetic Data Truly Protect Privacy?

Zhao, Yunpeng, Zhang, Jie

arXiv.org Artificial IntelligenceFeb-18-2025

As synthetic data becomes increasingly popular in machine learning tasks, numerous methods--without formal differential privacy guarantees--use synthetic data for training. These methods often claim, either explicitly or implicitly, to protect the privacy of the original training data. In this work, we explore four different training paradigms: coreset selection, dataset distillation, data-free knowledge distillation, and synthetic data generated from diffusion models. While all these methods utilize synthetic data for training, they lead to vastly different conclusions regarding privacy preservation. We caution that empirical approaches to preserving data privacy require careful and rigorous evaluation; otherwise, they risk providing a false sense of privacy. Synthetic data is increasingly utilized for training machine learning (ML) models, especially in situations where real-world data is scarce, sensitive, costly to obtain, or subject to regulations such as GDPR (GDPR.eu). Synthetic data is particularly beneficial in scenarios where data distributions are atypical, such as in federated learning with non-IID data (Zhang et al., 2023c), long-tailed learning (Shin et al., 2023), and continual learning (Meng et al., 2024). It enables the creation of diverse datasets that include edge cases or rare events that may be underrepresented in real-world data. Consequently, training models with synthetic data has proven beneficial for enhancing model robustness and adaptability across a wide range of real-world scenarios. Many empirical methods--without formal differential privacy guarantees--rely on synthetic data for training, such as coreset selection (Feldman, 2020), dataset distillation (Wang et al., 2018), data-free knowledge distillation (Yin et al., 2020), and synthetic data generated from diffusion models (Yuan et al., 2024). This proxy data can be directly sampled from private sources (Guo et al., 2022; Mirzasoleiman et al., 2020) or out-of-distribution sources (Wang et al., 2023), iteratively optimized (Zhang et al., 2023d; Zhao et al., 2020), or generated using GANs (Karras et al., 2019) and diffusion models (Rombach et al., 2022). Since the model may never encounter any private training data and the synthetic images are often visually distinct from the original private data, these methods often claim to preserve privacy while still maintaining satisfactory performance. In this work, we aim to address the following question: Does training with synthetic data truly protect privacy? To rigorously measure the privacy leakage of empirical methods trained on synthetic data, we use membership inference attacks (Shokri et al., 2017) as a privacy auditing tool.

artificial intelligence, machine learning, synthetic data, (16 more...)

arXiv.org Artificial Intelligence

2502.12976

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Add feedback

Community Detection with Heterogeneous Block Covariance Model

Li, Xiang, Zhao, Yunpeng, Pan, Qing, Hao, Ning

arXiv.org Machine LearningDec-4-2024

Community detection is the task of clustering objects based on their pairwise relationships. Most of the model-based community detection methods, such as the stochastic block model and its variants, are designed for networks with binary (yes/no) edges. In many practical scenarios, edges often possess continuous weights, spanning positive and negative values, which reflect varying levels of connectivity. To address this challenge, we introduce the heterogeneous block covariance model (HBCM) that defines a community structure within the covariance matrix, where edges have signed and continuous weights. Furthermore, it takes into account the heterogeneity of objects when forming connections with other objects within a community. A novel variational expectation-maximization algorithm is proposed to estimate the group membership. The HBCM provides provable consistent estimates of memberships, and its promising performance is observed in numerical simulations with different setups. The model is applied to a single-cell RNA-seq dataset of a mouse embryo and a stock price dataset. Supplementary materials for this article are available online.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

2412.0378

Country: North America > United States (0.92)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Banking & Finance > Trading (0.88)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks

Zhao, Yunpeng, Hao, Ning, Zhu, Ji

arXiv.org Machine LearningJun-16-2022

Biclustering on bipartite graphs is an unsupervised learning task that simultaneously clusters the two types of objects in the graph, for example, users and movies in a movie review dataset. The latent block model (LBM) has been proposed as a model-based tool for biclustering. Biclustering results by the LBM are, however, usually dominated by the row and column sums of the data matrix, i.e., degrees. We propose a degree-corrected latent block model (DC-LBM) to accommodate degree heterogeneity in row and column clusters, which greatly outperforms the classical LBM in the MovieLens dataset and simulated data. We develop an efficient variational expectation-maximization algorithm by observing that the row and column degrees maximize the objective function in the M step given any probability assignment on the cluster labels. We prove the label consistency of the variational estimator under the DC-LBM, which allows the expected graph density goes to zero as long as the average expected degrees of rows and columns go to infinity.

artificial intelligence, degree-corrected latent block model, machine learning, (2 more...)

arXiv.org Machine Learning

2206.08465

Genre: Research Report (0.40)

Industry:

Media > Film (0.87)
Leisure & Entertainment (0.53)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.87)

Add feedback

Understanding Perceptions and Attitudes in Breast Cancer Discussions on Twitter

Modave, Francois, Zhao, Yunpeng, Krieger, Janice, He, Zhe, Guo, Yi, Huo, Jinhai, Prosperi, Mattia, Bian, Jiang

arXiv.org Machine LearningMay-22-2019

Among American women, the rate of breast cancer is only second to lung cancer. An estimated 12.4% women will develop breast cancer over the course of their lifetime. The widespread use of social media across the socio-economic spectrum offers unparalleled ways to facilitate information sharing, in particular as it pertains to health. Social media is also used by many healthcare stakeholders, ranging from government agencies to healthcare industry, to disseminate health information and to engage patients. The purpose of this study is to investigate people's perceptions and attitudes relate to breast cancer, especially those that are related to physical activities, on Twitter. To achieve this, we first identified and collected tweets related to breast cancer; and then used topic modeling and sentiment analysis techniques to understanding discussion themes and quantify Twitter users' perceptions and emotions w.r.t breast cancer to answer 5 research questions.

deep learning, neural network, tweet, (22 more...)

arXiv.org Machine Learning

1905.12469

Country: North America > United States > Florida > Alachua County > Gainesville (0.29)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.88)

Industry: Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Logistic Regression Augmented Community Detection for Network Data with Application in Identifying Autism-Related Gene Pathways

Zhao, Yunpeng, Pan, Qing, Du, Chengan

arXiv.org Machine LearningSep-6-2018

When searching for gene pathways leading to specific disease outcomes, additional information on gene characteristics is often available that may facilitate to differentiate genes related to the disease from irrelevant background when connections involving both types of genes are observed and their relationships to the disease are unknown. We propose method to single out irrelevant background genes with the help of auxiliary information through a logistic regression, and cluster relevant genes into cohesive groups using the adjacency matrix. Expectation-maximization algorithm is modified to maximize a joint pseudo-likelihood assuming latent indicators for relevance to the disease and latent group memberships as well as Poisson or multinomial distributed link numbers within and between groups. A robust version allowing arbitrary linkage patterns within the background is further derived. Asymptotic consistency of label assignments under the stochastic blockmodel is proven. Superior performance and robustness in finite samples are observed in simulation studies. The proposed robust method identifies previously missed gene sets underlying autism related neurological diseases using diverse data sources including de novo mutations, gene expressions and protein-protein interactions.

bayesian inference, community detection, neurology, (21 more...)

arXiv.org Machine Learning

1809.02262

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Neurology > Autism (0.71)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.63)
(2 more...)

Add feedback

A Survey on Theoretical Advances of Community Detection in Networks

Zhao, Yunpeng

arXiv.org Machine LearningAug-25-2018

Real-world networks usually have community structure, that is, nodes are grouped into densely connected communities. Community detection is one of the most popular and best-studied research topics in network science and has attracted attention in many different fields, including computer science, statistics, social sciences, among others. Numerous approaches for community detection have been proposed in literature, from ad-hoc algorithms to systematic model-based approaches. The large number of available methods leads to a fundamental question: whether a certain method can provide consistent estimates of community labels. The stochastic blockmodel (SBM) and its variants provide a convenient framework for the study of such problems. This article is a survey on the recent theoretical advances of community detection. The authors review a number of community detection methods and their theoretical properties, including graph cut methods, profile likelihoods, the pseudo-likelihood method, the variational method, belief propagation, spectral clustering, and semidefinite relaxations of the SBM. The authors also briefly discuss other research topics in community detection such as robust community detection, community detection with nodal covariates and model selection, as well as suggest a few possible directions for future research.

community detection, health & medicine, survey article, (19 more...)

arXiv.org Machine Learning

doi: 10.1002/wics.1403

1809.07691

Country:

North America > United States (0.46)
Europe > United Kingdom > England (0.14)

Genre:

Research Report (0.50)
Overview (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

On Consistency of Graph-based Semi-supervised Learning

Du, Chengan, Zhao, Yunpeng

arXiv.org Machine LearningMar-17-2017

Graph-based semi-supervised learning is one of the most popular methods in machine learning. Some of its theoretical properties such as bounds for the generalization error and the convergence of the graph Laplacian regularizer have been studied in computer science and statistics literatures. However, a fundamental statistical property, the consistency of the estimator from this method has not been proved. In this article, we study the consistency problem under a non-parametric framework. We prove the consistency of graph-based learning in the case that the estimated scores are enforced to be equal to the observed responses for the labeled data. The sample sizes of both labeled and unlabeled data are allowed to grow in this result. When the estimated scores are not required to be equal to the observed responses, a tuning parameter is used to balance the loss function and the graph Laplacian regularizer. We give a counterexample demonstrating that the estimator for this case can be inconsistent. The theoretical findings are supported by numerical studies.

artificial intelligence, criterion, machine learning, (13 more...)

arXiv.org Machine Learning

1703.06177

Country:

North America > United States (0.14)
North America > Canada (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Link prediction for partially observed networks

Zhao, Yunpeng, Levina, Elizaveta, Zhu, Ji

arXiv.org Machine LearningJan-29-2013

Link prediction is one of the fundamental problems in network analysis. In many applications, notably in genetics, a partially observed network may not contain any negative examples of absent edges, which creates a difficulty for many existing supervised learning approaches. We develop a new method which treats the observed network as a sample of the true network with different sampling rates for positive and negative examples. We obtain a relative ranking of potential links by their probabilities, utilizing information on node covariates as well as on network topology. Empirically, the method performs well under many settings, including when the observed network is sparse. We apply the method to a protein-protein interaction network and a school friendship network.

health & medicine, inductive learning, probability, (18 more...)

arXiv.org Machine Learning

1301.7047

Country: North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.49)
Education (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.90)

Add feedback