$b$-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches

Oct-18-2019–arXiv.org Machine Learning

--Recently, randomly mapping vectorial data to strings of discrete symbols (i.e., sketches) for fast and space-efficient similarity searches has become popular . Such random mapping is called similarity-preserving hashing and approximates a similarity metric by using the Hamming distance. Although many efficient similarity searches have been proposed, most of them are designed for binary sketches. Similarity searches on integer sketches are in their infancy. In this paper, we present a novel space-efficient trie named b -bit sketch trie on integer sketches for scalable similarity searches by leveraging the idea behind succinct data structures (i.e., space-efficient data structures while supporting various data operations in the compressed format) and a favorable property of integer sketches as fixed-length strings. Our experimental results obtained using real-world datasets show that a trie-based index is built from integer sketches and efficiently performs similarity searches on the index by pruning useless portions of the search space, which greatly improves the search time and space-efficiency of the similarity search. The experimental results show that our similarity search is at most one order of magnitude faster than state-of-the-art similarity searches. Besides, our method needs only 10 GiB of memory on a billion-scale database, while state-of-the-art similarity searches need 29 GiB of memory. I NTRODUCTION The similarity search of vectorial data in databases has been a fundamental task in recent data analysis, and it has various applications such as near duplicate detection in a collection of web pages [1], context-based retrieval in images [2], and functional analysis of molecules [3]. Recently, databases in these applications have become large, and vectorial data in these databases also have been high dimensional, which makes it difficult to apply existing similarity search methods to such large databases. There is thus a strong need to develop much more powerful methods of similarity search for efficiently analyzing databases on a large-scale. A powerful solution to address this need is similarity-preserving hashing, which intends to approximate a similarity measure by randomly mapping vectorial data in a metric space to strings of discrete symbols (i.e., sketches) in the Hamming space. Early methods include Sim-Hash for cosine similarity [4], which intends to build binary sketches from vectorial data for approximating cosine similarity.

machine learning, natural language, sketch, (21 more...)

arXiv.org Machine Learning

Oct-18-2019

arXiv.org PDF

Add feedback

Country:
- North America > Mexico
  - Gulf of Mexico (0.04)
- Asia > Japan
  - Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Search (1.00)
  - Natural Language > Text Processing (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found