Hashing Algorithms for Large-Scale Learning
Li, Ping, Shrivastava, Anshumali, Moore, Joshua, Konig, Arnd Christian
In this paper, we first demonstrate that b-bit minwise hashing, whose estimators are positive definite kernels, can be naturally integrated with learning algorithms such as SVM and logistic regression. We adopt a simple scheme to transform the nonlinear (resemblance) kernel into linear (inner product) kernel; and hence large-scale problems can be solved extremely efficiently. Our method provides a simple effective solution to large-scale learning in massive and extremely high-dimensional datasets, especially when data do not fit in memory. We then compare b-bit minwise hashing with the Vowpal Wabbit (VW) algorithm (which is related the Count-Min (CM) sketch). Interestingly, VW has the same variances as random projections. Our theoretical and empirical comparisons illustrate that usually $b$-bit minwise hashing is significantly more accurate (at the same storage) than VW (and random projections) in binary data. Furthermore, $b$-bit minwise hashing can be combined with VW to achieve further improvements in terms of training speed, especially when $b$ is large.
Jun-6-2011
- Country:
- Europe (1.00)
- North America > United States
- California > Santa Clara County (0.28)
- Genre:
- Research Report (1.00)
- Industry:
- Information Technology (0.46)
- Technology: