Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling
–Neural Information Processing Systems
Jaccard similarity is widely used as a distance measure in many machine learning and search applications. Typically, hashing methods are essential for the use of Jaccard similarity to be practical in large-scale settings. For hashing binary (0/1) data, the idea of one permutation hashing (OPH) with densification significantly accelerates traditional minwise hashing algorithms while providing unbiased and accurate estimates. In this paper, we propose a strategy named "re-randomization" in the process of densification that could achieve the smallest variance among all densification schemes. The success of this idea naturally inspires us to generalize one permutation hashing to weighted (non-binary) data, which results in the socalled "bin-wise consistent weighted sampling (BCWS)" algorithm.
Neural Information Processing Systems
Oct-10-2024, 15:24:39 GMT
- Technology: