AITopics | minwise

Collaborating Authors

minwise

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Simple and Efficient Weighted Minwise Hashing

Anshumali Shrivastava

Neural Information Processing SystemsApr-22-2026, 01:53:20 GMT

Weighted minwise hashing (WMH) is one of the fundamental subroutine, required by many celebrated approximation algorithms, commonly adopted in industrial practice for large -scale search and learning. The resource bottleneck with WMH is the computation of multiple (typically a few hundreds to thousands) independent hashes of the data. We propose a simple rejection type sampling scheme based on a carefully designed red-green map, where we show that the number of rejected sample has exactly the same distribution as weighted minwise sampling. The running time of our method, for many practical datasets, is an order of magnitude smaller than existing methods. Experimental evaluations, on real datasets, show that for computing 500 WMH, our proposal can be 60000x faster than the Ioffe's method without losing any accuracy. Our method is also around 100x faster than approximate heuristics capitalizing on the efficient "densified" one permutation hashing schemes [26, 27]. Given the simplicity of our approach and its significant advantages, we hope that it will replace existing implementations in practice.

artificial intelligence, data mining, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States (0.46)
North America > Canada > Quebec (0.28)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Simple and Efficient Weighted Minwise Hashing

Neural Information Processing SystemsMar-17-2026, 11:05:48 GMT

Weighted minwise hashing (WMH) is one of the fundamental subroutine, required by many celebrated approximation algorithms, commonly adopted in industrial practice for large -scale search and learning. The resource bottleneck with WMH is the computation of multiple (typically a few hundreds to thousands) independent hashes of the data. We propose a simple rejection type sampling scheme based on a carefully designed red-green map, where we show that the number of rejected sample has exactly the same distribution as weighted minwise sampling. The running time of our method, for many practical datasets, is an order of magnitude smaller than existing methods. Experimental evaluations, on real datasets, show that for computing 500 WMH, our proposal can be 60000x faster than the Ioffe's method without losing any accuracy. Our method is also around 100x faster than approximate heuristics capitalizing on the efficient ``densified one permutation hashing schemes~\cite{Proc:OneHashLSH UAI14}. Given the simplicity of our approach and its significant advantages, we hope that it will replace existing implementations in practice.

artificial intelligence, machine learning, neural information processing system 29, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.44)

Add feedback

Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling

Ping Li, Xiaoyun Li, Cun-Hui Zhang

Neural Information Processing SystemsFeb-13-2026, 06:39:58 GMT

Typically, hashing methods are essential for the use of Jaccard similarity tobepractical inlarge-scale settings.

artificial intelligence, data mining, machine learning, (18 more...)

Neural Information Processing Systems

Country:

Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
Oceania > Australia > New South Wales > Sydney (0.05)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)
(12 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Data Science > Data Mining (0.68)

Add feedback

Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling

Ping Li, Xiaoyun Li, Cun-Hui Zhang

Neural Information Processing SystemsOct-3-2025, 08:13:22 GMT

Jaccard similarity to be practical in large-scale settings.

international conference, non-empty bin, proceedings, (13 more...)

Neural Information Processing Systems

Country:

Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
(12 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

One Permutation Hashing

Neural Information Processing SystemsMar-14-2024, 20:33:40 GMT

Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, b-bit minwise hashing has been applied to large-scale learning and sublinear time nearneighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 permutations on the data. This paper presents a simple solution called one permutation hashing. Conceptually, given a binary data matrix, we permute the columns once and divide the permuted columns evenly into k bins; and we store, for each data vector, the smallest nonzero location in each bin. The probability analysis illustrates that this one permutation scheme should perform similarly to the original (k-permutation) minwise hashing. Our experiments with training SVM and logistic regression confirm that one permutation hashing can achieve similar (or even better) accuracies compared to the k-permutation scheme. See more details in arXiv:1208.1259.

accuracy, minwise, permutation, (11 more...)

Neural Information Processing Systems

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
North America > United States > Texas > Dallas County > Dallas (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(9 more...)

Genre: Research Report > New Finding (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.35)

Add feedback

Beyond Pairwise: Provably Fast Algorithms for Approximate k-Way Similarity Search

Neural Information Processing SystemsMar-13-2024, 23:18:39 GMT

We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions.

log 1, resemblance, similarity, (16 more...)

Neural Information Processing Systems

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)
North America > United States > Texas > Dallas County > Dallas (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
(9 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)

Add feedback

Simple and Efficient Weighted Minwise Hashing

Neural Information Processing SystemsMar-12-2024, 17:13:24 GMT

algorithm, dataset, minwise, (15 more...)

Neural Information Processing Systems

Country:

Europe > Italy (0.04)
North America > United States > Texas > Harris County > Houston (0.04)
North America > United States > New York > Kings County > New York City (0.04)
(5 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

One Permutation Hashing

Neural Information Processing SystemsFeb-16-2024, 07:51:25 GMT

While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) k 500 permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple \textbf{one permutation hashing} scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing k permutations to just one would be much more \textbf{energy-efficient}, which might be an important perspective as minwise hashing is commonly deployed in the search industry.

minwise, permutation hashing, textbf

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.46)

Add feedback

Pb-Hash: Partitioned b-bit Hashing

Li, Ping, Zhao, Weijie

arXiv.org Artificial IntelligenceJun-28-2023

Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of $B$ bits. With $k$ hashes for each data vector, the storage would be $B\times k$ bits; and when used for large-scale learning, the model size would be $2^B\times k$, which can be expensive. A standard strategy is to use only the lowest $b$ bits out of the $B$ bits and somewhat increase $k$, the number of hashes. In this study, we propose to re-use the hashes by partitioning the $B$ bits into $m$ chunks, e.g., $b\times m =B$. Correspondingly, the model size becomes $m\times 2^b \times k$, which can be substantially smaller than the original $2^B\times k$. Our theoretical analysis reveals that by partitioning the hash values into $m$ chunks, the accuracy would drop. In other words, using $m$ chunks of $B/m$ bits would not be as accurate as directly using $B$ bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,) $m=2\sim 4$. In some regions, Pb-Hash still works well even for $m$ much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. We verify the effectiveness of Pb-Hash in machine learning tasks, for linear SVM models as well as deep learning models. Since the hashed data are essentially categorical (ID) features, we follow the standard practice of using embedding tables for each hash. With Pb-Hash, we need to design an effective strategy to combine $m$ embeddings. Our study provides an empirical evaluation on four pooling schemes: concatenation, max pooling, mean pooling, and product pooling. There is no definite answer which pooling would be always better and we leave that for future study.

artificial intelligence, machine learning, proceedings, (16 more...)

arXiv.org Artificial Intelligence

2306.15944

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Afghanistan > Parwan Province > Charikar (0.05)
North America > United States > New York > New York County > New York City (0.05)
(34 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.68)

Add feedback

One Permutation Hashing

Li, Ping, Owen, Art, Zhang, Cun-hui

Neural Information Processing SystemsFeb-15-2020, 00:27:11 GMT

While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) $k 500$ permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple \textbf{one permutation hashing} scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing $k$ permutations to just one would be much more \textbf{energy-efficient}, which might be an important perspective as minwise hashing is commonly deployed in the search industry.

minwise, permutation hashing, textbf

Neural Information Processing Systems

Genre: Research Report (0.32)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.52)

Add feedback