Goto

Collaborating Authors

 Li, Ximing


Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms

arXiv.org Artificial Intelligence

Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each lowdimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for everyǫ-DP synthetic data generator. In recent years, the problem of privacy-preserving data analysis has become increasingly important and differential privacy (Dwork et al., 2006) appears as the foundation of data privacy. Differential privacy (DP) techniques are widely adopted by industrial companies and the U.S. Census Bureau (Johnson et al., 2017; Erlingsson et al., 2014; Nguyên et al., 2016; The U.S. Census Bureau, 2020; Abowd, 2018).


Learning with Partial Labels from Semi-supervised Perspective

arXiv.org Artificial Intelligence

Partial Label (PL) learning refers to the task of learning from the partially labeled data, where each training instance is ambiguously equipped with a set of candidate labels but only one is valid. Advances in the recent deep PL learning literature have shown that the deep learning paradigms, e.g., self-training, contrastive learning, or class activate values, can achieve promising performance. Inspired by the impressive success of deep Semi-Supervised (SS) learning, we transform the PL learning problem into the SS learning problem, and propose a novel PL learning method, namely Partial Label learning with Semi-supervised Perspective (PLSP). Specifically, we first form the pseudo-labeled dataset by selecting a small number of reliable pseudo-labeled instances with high-confidence prediction scores and treating the remaining instances as pseudo-unlabeled ones. Then we design a SS learning objective, consisting of a supervised loss for pseudo-labeled instances and a semantic consistency regularization for pseudo-unlabeled instances. We further introduce a complementary regularization for those non-candidate labels to constrain the model predictions on them to be as small as possible. Empirical results demonstrate that PLSP significantly outperforms the existing PL baseline methods, especially on high ambiguity levels. Code available: https://github.com/changchunli/PLSP.


Weakly Supervised Prototype Topic Model with Discriminative Seed Words: Modifying the Category Prior by Self-exploring Supervised Signals

arXiv.org Artificial Intelligence

Dataless text classification, i.e., a new paradigm of weakly supervised learning, refers to the task of learning with unlabeled documents and a few predefined representative words of categories, known as seed words. The recent generative dataless methods construct document-specific category priors by using seed word occurrences only, however, such category priors often contain very limited and even noisy supervised signals. To remedy this problem, in this paper we propose a novel formulation of category prior. First, for each document, we consider its label membership degree by not only counting seed word occurrences, but also using a novel prototype scheme, which captures pseudo-nearest neighboring categories. Second, for each label, we consider its frequency prior knowledge of the corpus, which is also a discriminative knowledge for classification. By incorporating the proposed category prior into the previous generative dataless method, we suggest a novel generative dataless method, namely Weakly Supervised Prototype Topic Model (WSPTM). The experimental results on real-world datasets demonstrate that WSPTM outperforms the existing baseline methods.


Variational Wasserstein Barycenters with c-Cyclical Monotonicity

arXiv.org Machine Learning

Summarizing, combining and comparing probability distributions defined on a metric are fundamental tasks in machine learning, statistics and computer science, including multiple sensors, Bayesian inference, among others. For instance, in Bayesian inference one runs posterior sampling algorithm in parallel on different machines using small subsets of the massive data, and then aggregates subset posterior distributions via their barycenter as an approximation to the true posterior for the full data [1, 2]. Besides Bayesian inference, the average or barycenter of a collection of distributions has been successfully applied in various machine learning applications, say image processing [3] and clustering [4, 5]. The theory of optimal transport (OT) [6-9] provides a powerful framework to carry out such comparisons. OT equips the space of distributions with a distance metric known as the Wasserstein distance, which has gained substantial popularity in different fields, leading in particular to the natural consideration of barycenters. The barycenter of multiple given probability distributions under Wasserstein distance is defined as a distribution minimizing the sum of Wasserstein distances to all distributions. Due to the geometric properties of Wasserstein distance, the Wasserstein barycenter can better capture the underlying geometric structure than the barycenter with respect to other popular distances, e.g., Euclidean distance, see Figure 1. As a result, Wasserstein barycenters have a broad range of applications in text mixing [3], imaging [2, 10, 11], and model ensemble [12].


Recovering Accurate Labeling Information from Partially Valid Data for Effective Multi-Label Learning

arXiv.org Machine Learning

Partial Multi-label Learning (PML) aims to induce the multi-label predictor from datasets with noisy supervision, where each training instance is associated with several candidate labels but only partially valid. To address the noisy issue, the existing PML methods basically recover the ground-truth labels by leveraging the ground-truth confidence of the candidate label, \ie the likelihood of a candidate label being a ground-truth one. However, they neglect the information from non-candidate labels, which potentially contributes to the ground-truth label recovery. In this paper, we propose to recover the ground-truth labels, \ie estimating the ground-truth confidences, from the label enrichment, composed of the relevance degrees of candidate labels and irrelevance degrees of non-candidate labels. Upon this observation, we further develop a novel two-stage PML method, namely \emph{\underline{P}artial \underline{M}ulti-\underline{L}abel \underline{L}earning with \underline{L}abel \underline{E}nrichment-\underline{R}ecovery} (\baby), where in the first stage, it estimates the label enrichment with unconstrained label propagation, then jointly learns the ground-truth confidence and multi-label predictor given the label enrichment. Experimental results validate that \baby outperforms the state-of-the-art PML methods.