Goto

Collaborating Authors

 b-bit minwise


Hashing Algorithms for Large-Scale Learning

Neural Information Processing Systems

Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare b -bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.


One Permutation Hashing

Neural Information Processing Systems

Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, b-bit minwise hashing has been applied to large-scale learning and sublinear time nearneighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 permutations on the data. This paper presents a simple solution called one permutation hashing. Conceptually, given a binary data matrix, we permute the columns once and divide the permuted columns evenly into k bins; and we store, for each data vector, the smallest nonzero location in each bin. The probability analysis illustrates that this one permutation scheme should perform similarly to the original (k-permutation) minwise hashing. Our experiments with training SVM and logistic regression confirm that one permutation hashing can achieve similar (or even better) accuracies compared to the k-permutation scheme. See more details in arXiv:1208.1259.



Hashing Algorithms for Large-Scale Learning

Neural Information Processing Systems

Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare $b$-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.


Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search

Neural Information Processing Systems

We go beyond the notion of pairwise similarity and look into search problems with $k$-way similarity functions. In this paper, we focus on problems related to \emph{3-way Jaccard} similarity: $\mathcal{R}^{3way}= \frac{|S_1 \cap S_2 \cap S_3|}{|S_1 \cup S_2 \cup S_3|}$, $S_1, S_2, S_3 \in \mathcal{C}$, where $\mathcal{C}$ is a size $n$ collection of sets (or binary vectors). We show that approximate $\mathcal{R}^{3way}$ similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to $k$-way resemblance. In the process, we extend traditional framework of \emph{locality sensitive hashing (LSH)} to handle higher order similarities, which could be of independent theoretical interest. The applicability of $\mathcal{R}^{3way}$ search is shown on the Google sets" application. In addition, we demonstrate the advantage of $\mathcal{R}^{3way}$ resemblance over the pairwise case in improving retrieval quality."


One Permutation Hashing

Neural Information Processing Systems

While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) $k=500$ permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple \textbf{one permutation hashing} scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing $k$ permutations to just one would be much more \textbf{energy-efficient}, which might be an important perspective as minwise hashing is commonly deployed in the search industry. While the theoretical probability analysis is interesting, our experiments on similarity estimation and SVM \& logistic regression also confirm the theoretical results.


Hashing Algorithms for Large-Scale Learning

Neural Information Processing Systems

Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare $b$-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.


Training Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW)

arXiv.org Machine Learning

We generated a dataset of 200 GB with 10^9 features, to test our recent b-bit minwise hashing algorithms for training very large-scale logistic regression and SVM. The results confirm our prior work that, compared with the VW hashing algorithm (which has the same variance as random projections), b-bit minwise hashing is substantially more accurate at the same storage. For example, with merely 30 hashed values per data point, b-bit minwise hashing can achieve similar accuracies as VW with 2^14 hashed values per data point. We demonstrate that the preprocessing cost of b-bit minwise hashing is roughly on the same order of magnitude as the data loading time. Furthermore, by using a GPU, the preprocessing cost can be reduced to a small fraction of the data loading time. Minwise hashing has been widely used in industry, at least in the context of search. One reason for its popularity is that one can efficiently simulate permutations by (e.g.,) universal hashing. In other words, there is no need to store the permutation matrix. In this paper, we empirically verify this practice, by demonstrating that even using the simplest 2-universal hashing does not degrade the learning performance.


Accurate Estimators for Improving Minwise Hashing and b-Bit Minwise Hashing

arXiv.org Machine Learning

Minwise hashing is the standard technique in the context of search and databases for efficiently estimating set (e.g., high-dimensional 0/1 vector) similarities. Recently, b-bit minwise hashing was proposed which significantly improves upon the original minwise hashing in practice by storing only the lowest b bits of each hashed value, as opposed to using 64 bits. b-bit hashing is particularly effective in applications which mainly concern sets of high similarities (e.g., the resemblance >0.5). However, there are other important applications in which not just pairs of high similarities matter. For example, many learning algorithms require all pairwise similarities and it is expected that only a small fraction of the pairs are similar. Furthermore, many applications care more about containment (e.g., how much one object is contained by another object) than the resemblance. In this paper, we show that the estimators for minwise hashing and b-bit minwise hashing used in the current practice can be systematically improved and the improvements are most significant for set pairs of low resemblance and high containment.


Hashing Algorithms for Large-Scale Learning

arXiv.org Machine Learning

In this paper, we first demonstrate that b-bit minwise hashing, whose estimators are positive definite kernels, can be naturally integrated with learning algorithms such as SVM and logistic regression. We adopt a simple scheme to transform the nonlinear (resemblance) kernel into linear (inner product) kernel; and hence large-scale problems can be solved extremely efficiently. Our method provides a simple effective solution to large-scale learning in massive and extremely high-dimensional datasets, especially when data do not fit in memory. We then compare b-bit minwise hashing with the Vowpal Wabbit (VW) algorithm (which is related the Count-Min (CM) sketch). Interestingly, VW has the same variances as random projections. Our theoretical and empirical comparisons illustrate that usually $b$-bit minwise hashing is significantly more accurate (at the same storage) than VW (and random projections) in binary data. Furthermore, $b$-bit minwise hashing can be combined with VW to achieve further improvements in terms of training speed, especially when $b$ is large.