Sampling-Based Estimation of Jaccard Containment and Similarity
Estimating set similarity measures is a fundamental problem in data analysis, with applications in information retrieval, database systems, and streaming algorithms. Among such measures, the Jaccard containment of two sets A, B--defined as ϕ = | A B | |A | [0, 1] when A is treated as the reference set--is particularly important in asymmetric comparison tasks, such as detecting near-duplicates or containment-based joins. In large-scale settings, exact computation of ϕ may be infeasible, as it requires full knowledge of both sets. Sampling-based estimators that use small random subsets P A and Q B can be used as scalable alternatives when the sizes |A |, |B | are known, such as in Oracle databases. This paper presents a theoretical analysis of the likelihood models and estimation strategies for Jaccard containment based on random samples, focusing on both empirical performance and statistical guarantees.
Jul-22-2025
- Country:
- Europe > United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England
- Genre:
- Research Report (0.50)
- Technology: