Pagh, Rasmus
Tensor Sketch: Fast and Scalable Polynomial Kernel Approximation
Pham, Ninh, Pagh, Rasmus
Approximation of non-linear kernels using random feature maps has become a powerful technique for scaling kernel methods to large datasets. We propose $\textit{Tensor Sketch}$, an efficient random feature map for approximating polynomial kernels. Given $n$ training samples in $\mathbb{R}^d$ Tensor Sketch computes low-dimensional embeddings in $\mathbb{R}^D$ in time $\mathcal{O}\left( n(d+D \log{D}) \right)$ making it well-suited for high-dimensional and large-scale settings. We provide theoretical guarantees on the approximation error, ensuring the fidelity of the resulting kernel function estimates. We also discuss extensions and highlight applications where Tensor Sketch serves as a central computational tool.
- Asia > Afghanistan > Parwan Province > Charikar (0.05)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- (2 more...)
- Overview (1.00)
- Research Report (0.64)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Optimal Bounds for Private Minimum Spanning Trees via Input Perturbation
Pagh, Rasmus, Retschmeier, Lukas, Wu, Hao, Zhang, Hanwen
We study the problem of privately releasing an approximate minimum spanning tree (MST). Given a graph $G = (V, E, \vec{W})$ where $V$ is a set of $n$ vertices, $E$ is a set of $m$ undirected edges, and $ \vec{W} \in \mathbb{R}^{|E|} $ is an edge-weight vector, our goal is to publish an approximate MST under edge-weight differential privacy, as introduced by Sealfon in PODS 2016, where $V$ and $E$ are considered public and the weight vector is private. Our neighboring relation is $\ell_\infty$-distance on weights: for a sensitivity parameter $\Delta_\infty$, graphs $ G = (V, E, \vec{W}) $ and $ G' = (V, E, \vec{W}') $ are neighboring if $\|\vec{W}-\vec{W}'\|_\infty \leq \Delta_\infty$. Existing private MST algorithms face a trade-off, sacrificing either computational efficiency or accuracy. We show that it is possible to get the best of both worlds: With a suitable random perturbation of the input that does not suffice to make the weight vector private, the result of any non-private MST algorithm will be private and achieves a state-of-the-art error guarantee. Furthermore, by establishing a connection to Private Top-k Selection [Steinke and Ullman, FOCS '17], we give the first privacy-utility trade-off lower bound for MST under approximate differential privacy, demonstrating that the error magnitude, $\tilde{O}(n^{3/2})$, is optimal up to logarithmic factors. That is, our approach matches the time complexity of any non-private MST algorithm and at the same time achieves optimal error. We complement our theoretical treatment with experiments that confirm the practicality of our approach.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > Denmark > Capital Region > Copenhagen (0.05)
- (9 more...)
Streaming Private Continual Counting via Binning
Andersson, Joel Daniel, Pagh, Rasmus
In differential privacy, $\textit{continual observation}$ refers to problems in which we wish to continuously release a function of a dataset that is revealed one element at a time. The challenge is to maintain a good approximation while keeping the combined output over all time steps differentially private. In the special case of $\textit{continual counting}$ we seek to approximate a sum of binary input elements. This problem has received considerable attention lately, in part due to its relevance in implementations of differentially private stochastic gradient descent. $\textit{Factorization mechanisms}$ are the leading approach to continual counting, but the best such mechanisms do not work well in $\textit{streaming}$ settings since they require space proportional to the size of the input. In this paper, we present a simple approach to approximating factorization mechanisms in low space via $\textit{binning}$, where adjacent matrix entries with similar values are changed to be identical in such a way that a matrix-vector product can be maintained in sublinear space. Our approach has provable sublinear space guarantees for a class of lower triangular matrices whose entries are monotonically decreasing away from the diagonal. We show empirically that even with very low space usage we are able to closely match, and sometimes surpass, the performance of asymptotically optimal factorization mechanisms. Recently, and independently of our work, Dvijotham et al. have also suggested an approach to implementing factorization mechanisms in a streaming setting. Their work differs from ours in several respects: It only addresses factorization into $\textit{Toeplitz}$ matrices, only considers $\textit{maximum}$ error, and uses a different technique based on rational function approximation that seems less versatile than our binning approach.
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Nevada (0.04)
- (2 more...)
- Research Report (0.63)
- Workflow (0.46)
Faster Private Minimum Spanning Trees
Pagh, Rasmus, Retschmeier, Lukas
Motivated by applications in clustering and synthetic data generation, we consider the problem of releasing a minimum spanning tree (MST) under edge-weight differential privacy constraints where a graph topology $G=(V,E)$ with $n$ vertices and $m$ edges is public, the weight matrix $\vec{W}\in \mathbb{R}^{n \times n}$ is private, and we wish to release an approximate MST under $\rho$-zero-concentrated differential privacy. Weight matrices are considered neighboring if they differ by at most $\Delta_\infty$ in each entry, i.e., we consider an $\ell_\infty$ neighboring relationship. Existing private MST algorithms either add noise to each entry in $\vec{W}$ and estimate the MST by post-processing or add noise to weights in-place during the execution of a specific MST algorithm. Using the post-processing approach with an efficient MST algorithm takes $O(n^2)$ time on dense graphs but results in an additive error on the weight of the MST of magnitude $O(n^2\log n)$. In-place algorithms give asymptotically better utility, but the running time of existing in-place algorithms is $O(n^3)$ for dense graphs. Our main result is a new differentially private MST algorithm that matches the utility of existing in-place methods while running in time $O(m + n^{3/2}\log n)$ for fixed privacy parameter $\rho$. The technical core of our algorithm is an efficient sublinear time simulation of Report-Noisy-Max that works by discretizing all edge weights to a multiple of $\Delta_\infty$ and forming groups of edges with identical weights. Specifically, we present a data structure that allows us to sample a noisy minimum weight edge among at most $O(n^2)$ cut edges in $O(\sqrt{n} \log n)$ time. Experimental evaluations support our claims that our algorithm significantly improves previous algorithms either in utility or running time.
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- North America > United States > Virginia > Alexandria County > Alexandria (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
Daisy Bloom Filters
Bercea, Ioana O., Houen, Jakob Bæk Tejs, Pagh, Rasmus
A filter is a widely used data structure for storing an approximation of a given set $S$ of elements from some universe $U$ (a countable set).It represents a superset $S'\supseteq S$ that is ''close to $S$'' in the sense that for $x\not\in S$, the probability that $x\in S'$ is bounded by some $\varepsilon > 0$. The advantage of using a Bloom filter, when some false positives are acceptable, is that the space usage becomes smaller than what is required to store $S$ exactly. Though filters are well-understood from a worst-case perspective, it is clear that state-of-the-art constructions may not be close to optimal for particular distributions of data and queries. Suppose, for instance, that some elements are in $S$ with probability close to 1. Then it would make sense to always include them in $S'$, saving space by not having to represent these elements in the filter. Questions like this have been raised in the context of Weighted Bloom filters (Bruck, Gao and Jiang, ISIT 2006) and Bloom filter implementations that make use of access to learned components (Vaidya, Knorr, Mitzenmacher, and Krask, ICLR 2021). In this paper, we present a lower bound for the expected space that such a filter requires. We also show that the lower bound is asymptotically tight by exhibiting a filter construction that executes queries and insertions in worst-case constant time, and has a false positive rate at most $\varepsilon $ with high probability over input sets drawn from a product distribution. We also present a Bloom filter alternative, which we call the $\textit{Daisy Bloom filter}$, that executes operations faster and uses significantly less space than the standard Bloom filter.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Europe > Italy > Lazio > Rome (0.04)
- (10 more...)
PLAN: Variance-Aware Private Mean Estimation
Aumüller, Martin, Lebeda, Christian Janos, Nelson, Boel, Pagh, Rasmus
Differentially private mean estimation is an important building block in privacy-preserving algorithms for data analysis and machine learning. Though the trade-off between privacy and utility is well understood in the worst case, many datasets exhibit structure that could potentially be exploited to yield better algorithms. In this paper we present $\textit{Private Limit Adapted Noise}$ (PLAN), a family of differentially private algorithms for mean estimation in the setting where inputs are independently sampled from a distribution $\mathcal{D}$ over $\mathbf{R}^d$, with coordinate-wise standard deviations $\boldsymbol{\sigma} \in \mathbf{R}^d$. Similar to mean estimation under Mahalanobis distance, PLAN tailors the shape of the noise to the shape of the data, but unlike previous algorithms the privacy budget is spent non-uniformly over the coordinates. Under a concentration assumption on $\mathcal{D}$, we show how to exploit skew in the vector $\boldsymbol{\sigma}$, obtaining a (zero-concentrated) differentially private mean estimate with $\ell_2$ error proportional to $\|\boldsymbol{\sigma}\|_1$. Previous work has either not taken $\boldsymbol{\sigma}$ into account, or measured error in Mahalanobis distance $\unicode{x2013}$ in both cases resulting in $\ell_2$ error proportional to $\sqrt{d}\|\boldsymbol{\sigma}\|_2$, which can be up to a factor $\sqrt{d}$ larger. To verify the effectiveness of PLAN, we empirically evaluate accuracy on both synthetic and real world data.
- Europe > Austria > Vienna (0.14)
- Europe > Denmark > Capital Region > Copenhagen (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
A Smooth Binary Mechanism for Efficient Private Continual Observation
Andersson, Joel Daniel, Pagh, Rasmus
In privacy under continual observation we study how to release differentially private estimates based on a dataset that evolves over time. The problem of releasing private prefix sums of $x_1,x_2,x_3,\dots \in\{0,1\}$ (where the value of each $x_i$ is to be private) is particularly well-studied, and a generalized form is used in state-of-the-art methods for private stochastic gradient descent (SGD). The seminal binary mechanism privately releases the first $t$ prefix sums with noise of variance polylogarithmic in $t$. Recently, Henzinger et al. and Denisov et al. showed that it is possible to improve on the binary mechanism in two ways: The variance of the noise can be reduced by a (large) constant factor, and also made more even across time steps. However, their algorithms for generating the noise distribution are not as efficient as one would like in terms of computation time and (in particular) space. We address the efficiency problem by presenting a simple alternative to the binary mechanism in which 1) generating the noise takes constant average time per value, 2) the variance is reduced by a factor about 4 compared to the binary mechanism, and 3) the noise distribution at each step is identical. Empirically, a simple Python implementation of our approach outperforms the running time of the approach of Henzinger et al., as well as an attempt to improve their algorithm using high-performance algorithms for multiplication with Toeplitz matrices.
- Europe > Denmark > Capital Region > Copenhagen (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
CountSketches, Feature Hashing and the Median of Three
Larsen, Kasper Green, Pagh, Rasmus, Tětek, Jakub
In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector $v$ to a vector of dimension $(2t-1) s$, where $t, s > 0$ are integer parameters. It is known that even for $t=1$, a CountSketch allows estimating coordinates of $v$ with variance bounded by $\|v\|_2^2/s$. For $t > 1$, the estimator takes the median of $2t-1$ independent estimates, and the probability that the estimate is off by more than $2 \|v\|_2/\sqrt{s}$ is exponentially small in $t$. This suggests choosing $t$ to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant $t$. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of Count-Sketch, showing an improvement in variance to $O(\min\{\|v\|_1^2/s^2,\|v\|_2^2/s\})$ when $t > 1$. That is, the variance decreases proportionally to $s^{-2}$, asymptotically for large enough $s$. We also study the variance in the setting where an inner product is to be estimated from two CountSketches. This finding suggests that the Feature Hashing method, which is essentially identical to CountSketch but does not make use of the median estimator, can be made more reliable at a small cost in settings where using a median estimator is possible. We confirm our theoretical findings in experiments and thereby help justify why a small constant number of estimates often suffice in practice. Our improved variance bounds are based on new general theorems about the variance and higher moments of the median of i.i.d. random variables that may be of independent interest.
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
WOR and $p$'s: Sketches for $\ell_p$-Sampling Without Replacement
Cohen, Edith, Pagh, Rasmus, Woodruff, David P.
Weighted sampling is a fundamental tool in data analysis and machine learning pipelines. Samples are used for efficient estimation of statistics or as sparse representations of the data. When weight distributions are skewed, as is often the case in practice, without-replacement (WOR) sampling is much more effective than with-replacement (WR) sampling: it provides a broader representation and higher accuracy for the same number of samples. We design novel composable sketches for WOR $\ell_p$ sampling, weighted sampling of keys according to a power $p\in[0,2]$ of their frequency (or for signed data, sum of updates). Our sketches have size that grows only linearly with the sample size. Our design is simple and practical, despite intricate analysis, and based on off-the-shelf use of widely implemented heavy hitters sketches such as CountSketch. Our method is the first to provide WOR sampling in the important regime of $p>1$ and the first to handle signed updates for $p>0$.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- North America > United States > Virginia > Alexandria County > Alexandria (0.04)
- (11 more...)
Advances and Open Problems in Federated Learning
Kairouz, Peter, McMahan, H. Brendan, Avent, Brendan, Bellet, Aurélien, Bennis, Mehdi, Bhagoji, Arjun Nitin, Bonawitz, Keith, Charles, Zachary, Cormode, Graham, Cummings, Rachel, D'Oliveira, Rafael G. L., Rouayheb, Salim El, Evans, David, Gardner, Josh, Garrett, Zachary, Gascón, Adrià, Ghazi, Badih, Gibbons, Phillip B., Gruteser, Marco, Harchaoui, Zaid, He, Chaoyang, He, Lie, Huo, Zhouyuan, Hutchinson, Ben, Hsu, Justin, Jaggi, Martin, Javidi, Tara, Joshi, Gauri, Khodak, Mikhail, Konečný, Jakub, Korolova, Aleksandra, Koushanfar, Farinaz, Koyejo, Sanmi, Lepoint, Tancrède, Liu, Yang, Mittal, Prateek, Mohri, Mehryar, Nock, Richard, Özgür, Ayfer, Pagh, Rasmus, Raykova, Mariana, Qi, Hang, Ramage, Daniel, Raskar, Ramesh, Song, Dawn, Song, Weikang, Stich, Sebastian U., Sun, Ziteng, Suresh, Ananda Theertha, Tramèr, Florian, Vepakomma, Praneeth, Wang, Jianyu, Xiong, Li, Xu, Zheng, Yang, Qiang, Yu, Felix X., Yu, Han, Zhao, Sen
FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges. Peter Kairouz and H. Brendan McMahan conceived, coordinated, and edited this work.
- North America > United States > California > San Francisco County > San Francisco (0.27)
- North America > United States > Virginia (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (29 more...)
- Overview (1.00)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.67)
- Research Report > Promising Solution (0.45)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- (3 more...)