adaptive distance estimation
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (17 more...)
On Adaptive Distance Estimation
We provide a static data structure for distance estimation which supports {\it adaptive} queries. Concretely, given a dataset $X = \{x_i\}_{i = 1}^n$ of $n$ points in $\mathbb{R}^d$ and $0 < p \leq 2$, we construct a randomized data structure with low memory consumption and query time which, when later given any query point $q \in \mathbb{R}^d$, outputs a $(1+\varepsilon)$-approximation of $\|q - x_i\|_p$ with high probability for all $i\in[n]$. The main novelty is our data structure's correctness guarantee holds even when the sequence of queries can be chosen adaptively: an adversary is allowed to choose the $j$th query point $q_j$ in a way that depends on the answers reported by the data structure for $q_1,\ldots,q_{j-1}$. Previous randomized Monte Carlo methods do not provide error guarantees in the setting of adaptively chosen queries. Our memory consumption is $\tilde O(nd/\varepsilon^2)$, slightly more than the $O(nd)$ required to store $X$ in memory explicitly, but with the benefit that our time to answer queries is only $\tilde O(\varepsilon^{-2}(n + d))$, much faster than the naive $\Theta(nd)$ time obtained from a linear scan in the case of $n$ and $d$ very large.
Review for NeurIPS paper: On Adaptive Distance Estimation
Summary and Contributions: The paper presents a data structure that can output a (1 eps)-approximation to the Euclidean distance from a query point q in R d to each of n preprocessed points in time O(n/eps 2) per query q, where for simplicity I assume d n and omit logarithmic factors. The method works not only for Euclidean distances, but for every l_p norm, 0 p 2. In comparison, a straightforward computation takes time O(nd). The key feature of this randomized data structure is that it works even for an *adaptive* sequence queries. This is trivial for deterministic data structures, but might not hold for randomized ones, including all currently known ones, which are all variants of the JL dimension reduction from dimension d to 1/eps 2. This robustness to adaptive queries is becoming increasingly important, if not crucial, in modern data analysis, which relies heavily on applying such procedures repeatedly and is prone to correlations. The solution is relatively simple technically.
On Adaptive Distance Estimation
We provide a static data structure for distance estimation which supports {\it adaptive} queries. Concretely, given a dataset X \{x_i\}_{i 1} n of n points in \mathbb{R} d and 0 p \leq 2, we construct a randomized data structure with low memory consumption and query time which, when later given any query point q \in \mathbb{R} d, outputs a (1 \varepsilon) -approximation of \ q - x_i\ _p with high probability for all i\in[n] . The main novelty is our data structure's correctness guarantee holds even when the sequence of queries can be chosen adaptively: an adversary is allowed to choose the j th query point q_j in a way that depends on the answers reported by the data structure for q_1,\ldots,q_{j-1} . Previous randomized Monte Carlo methods do not provide error guarantees in the setting of adaptively chosen queries. Our memory consumption is \tilde O(nd/\varepsilon 2), slightly more than the O(nd) required to store X in memory explicitly, but with the benefit that our time to answer queries is only \tilde O(\varepsilon {-2}(n d)), much faster than the naive \Theta(nd) time obtained from a linear scan in the case of n and d very large.