One Permutation Hashing

Mar-14-2024, 20:33:40 GMT–Neural Information Processing Systems

Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, b-bit minwise hashing has been applied to large-scale learning and sublinear time nearneighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 permutations on the data. This paper presents a simple solution called one permutation hashing. Conceptually, given a binary data matrix, we permute the columns once and divide the permuted columns evenly into k bins; and we store, for each data vector, the smallest nonzero location in each bin. The probability analysis illustrates that this one permutation scheme should perform similarly to the original (k-permutation) minwise hashing. Our experiments with training SVM and logistic regression confirm that one permutation hashing can achieve similar (or even better) accuracies compared to the k-permutation scheme. See more details in arXiv:1208.1259.

accuracy, minwise, permutation, (11 more...)

Neural Information Processing Systems

Mar-14-2024, 20:33:40 GMT

Conferences PDF

Add feedback

Country:
- North America
  - United States
    - Oregon (0.04)
    - Texas > Dallas County
      - Dallas (0.04)
    - Pennsylvania
      - Philadelphia County > Philadelphia (0.04)
      - Allegheny County > Pittsburgh (0.04)
    - North Carolina > Wake County
      - Raleigh (0.04)
    - California > Santa Clara County
      - Santa Clara (0.04)
      - San Jose (0.04)
      - Palo Alto (0.04)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Spain > Andalusia
    - Granada Province > Granada (0.04)
  - Hungary > Budapest
    - Budapest (0.04)
- Asia > Afghanistan
  - Parwan Province > Charikar (0.04)

Genre:
- Research Report > New Finding (0.49)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.35)