One Permutation Hashing
–Neural Information Processing Systems
Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, b-bit minwise hashing has been applied to large-scale learning and sublinear time nearneighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 permutations on the data. This paper presents a simple solution called one permutation hashing. Conceptually, given a binary data matrix, we permute the columns once and divide the permuted columns evenly into k bins; and we store, for each data vector, the smallest nonzero location in each bin. The probability analysis illustrates that this one permutation scheme should perform similarly to the original (k-permutation) minwise hashing. Our experiments with training SVM and logistic regression confirm that one permutation hashing can achieve similar (or even better) accuracies compared to the k-permutation scheme. See more details in arXiv:1208.1259.
Neural Information Processing Systems
Mar-14-2024, 20:33:40 GMT
- Country:
- Asia > Afghanistan
- Parwan Province > Charikar (0.04)
- Europe
- North America
- Canada > British Columbia
- United States
- California > Santa Clara County
- Palo Alto (0.04)
- San Jose (0.04)
- Santa Clara (0.04)
- North Carolina > Wake County
- Raleigh (0.04)
- Oregon (0.04)
- Pennsylvania
- Allegheny County > Pittsburgh (0.04)
- Philadelphia County > Philadelphia (0.04)
- Texas > Dallas County
- Dallas (0.04)
- California > Santa Clara County
- Asia > Afghanistan
- Genre:
- Research Report > New Finding (0.49)
- Technology: