Tantipongpipat, Uthaipon
$\lambda$-Regularized A-Optimal Design and its Approximation by $\lambda$-Regularized Proportional Volume Sampling
Tantipongpipat, Uthaipon
In this work, we study the $\lambda$-regularized $A$-optimal design problem and introduce the $\lambda$-regularized proportional volume sampling algorithm, generalized from [Nikolov, Singh, and Tantipongpipat, 2019], for this problem with the approximation guarantee that extends upon the previous work. In this problem, we are given vectors $v_1,\ldots,v_n\in\mathbb{R}^d$ in $d$ dimensions, a budget $k\leq n$, and the regularizer parameter $\lambda\geq0$, and the goal is to find a subset $S\subseteq [n]$ of size $k$ that minimizes the trace of $\left(\sum_{i\in S}v_iv_i^\top + \lambda I_d\right)^{-1}$ where $I_d$ is the $d\times d$ identity matrix. The problem is motivated from optimal design in ridge regression, where one tries to minimize the expected squared error of the ridge regression predictor from the true coefficient in the underlying linear model. We introduce $\lambda$-regularized proportional volume sampling and give its polynomial-time implementation to solve this problem. We show its $(1+\frac{\epsilon}{\sqrt{1+\lambda'}})$-approximation for $k=\Omega\left(\frac d\epsilon+\frac{\log 1/\epsilon}{\epsilon^2}\right)$ where $\lambda'$ is proportional to $\lambda$, extending the previous bound in [Nikolov, Singh, and Tantipongpipat, 2019] to the case $\lambda>0$ and obtaining asymptotic optimality as $\lambda\rightarrow \infty$.
The Price of Fair PCA: One Extra dimension
Samadi, Samira, Tantipongpipat, Uthaipon, Morgenstern, Jamie H., Singh, Mohit, Vempala, Santosh
We investigate whether the standard dimensionality reduction technique of PCA inadvertently produces data representations with different fidelity for two different populations. We show on several real-world data sets, PCA has higher reconstruction error on population A than on B (for example, women versus men or lower- versus higher-educated individuals). This can happen even when the data set has a similar number of samples from A and B. This motivates our study of dimensionality reduction techniques which maintain similar fidelity for A and B. We define the notion of Fair PCA and give a polynomial-time algorithm for finding a low dimensional representation of the data which is nearly-optimal with respect to this measure. Finally, we show on real-world data sets that our algorithm can be used to efficiently generate a fair low dimensional representation of the data.
Differential Privacy for Growing Databases
Cummings, Rachel, Krehbiel, Sara, Lai, Kevin A., Tantipongpipat, Uthaipon
The large majority of differentially private algorithms focus on the static setting, where queries are made on an unchanging database. This is unsuitable for the myriad applications involving databases that grow over time. To address this gap in the literature, we consider the dynamic setting, in which new data arrive over time. Previous results in this setting have been limited to answering a single nonadaptive query repeatedly as the database grows [DNPR10, CSS11]. In contrast, we provide tools for richer and more adaptive analysis of growing databases. Our first contribution is a novel modification of the private multiplicative weights algorithm of [HR10], which provides accurate analysis of exponentially many adaptive linear queries (an expressive query class including all counting queries) for a static database. Our modification maintains the accuracy guarantee of the static setting even as the database grows without bound. Our second contribution is a set of general results which show that many other private and accurate algorithms can be immediately extended to the dynamic setting by rerunning them at appropriate points of data growth with minimal loss of accuracy, even when data growth is unbounded.
Differential Privacy for Growing Databases
Cummings, Rachel, Krehbiel, Sara, Lai, Kevin A., Tantipongpipat, Uthaipon
The large majority of differentially private algorithms focus on the static setting, where queries are made on an unchanging database. This is unsuitable for the myriad applications involving databases that grow over time. To address this gap in the literature, we consider the dynamic setting, in which new data arrive over time. Previous results in this setting have been limited to answering a single non-adaptive query repeatedly as the database grows. In contrast, we provide tools for richer and more adaptive analysis of growing databases. Our first contribution is a novel modification of the private multiplicative weights algorithm, which provides accurate analysis of exponentially many adaptive linear queries (an expressive query class including all counting queries) for a static database. Our modification maintains the accuracy guarantee of the static setting even as the database grows without bound. Our second contribution is a set of general results which show that many other private and accurate algorithms can be immediately extended to the dynamic setting by rerunning them at appropriate points of data growth with minimal loss of accuracy, even when data growth is unbounded.
The Price of Fair PCA: One Extra Dimension
Samadi, Samira, Tantipongpipat, Uthaipon, Morgenstern, Jamie, Singh, Mohit, Vempala, Santosh
We investigate whether the standard dimensionality reduction technique of PCA inadvertently produces data representations with different fidelity for two different populations. We show on several real-world data sets, PCA has higher reconstruction error on population A than on B (for example, women versus men or lower- versus higher-educated individuals). This can happen even when the data set has a similar number of samples from A and B. This motivates our study of dimensionality reduction techniques which maintain similar fidelity for A and B. We define the notion of Fair PCA and give a polynomial-time algorithm for finding a low dimensional representation of the data which is nearly-optimal with respect to this measure. Finally, we show on real-world data sets that our algorithm can be used to efficiently generate a fair low dimensional representation of the data.