OPORP: One Permutation + One Random Projection
–arXiv.org Artificial Intelligence
Consider two $D$-dimensional data vectors (e.g., embeddings): $u, v$. In many embedding-based retrieval (EBR) applications where the vectors are generated from trained models, $D=256\sim 1024$ are common. In this paper, OPORP (one permutation + one random projection) uses a variant of the ``count-sketch'' type of data structures for achieving data reduction/compression. With OPORP, we first apply a permutation on the data vectors. A random vector $r$ is generated i.i.d. with moments: $E(r_i) = 0, E(r_i^2)=1, E(r_i^3) =0, E(r_i^4)=s$. We multiply (as dot product) $r$ with all permuted data vectors. Then we break the $D$ columns into $k$ equal-length bins and aggregate (i.e., sum) the values in each bin to obtain $k$ samples from each data vector. One crucial step is to normalize the $k$ samples to the unit $l_2$ norm. We show that the estimation variance is essentially: $(s-1)A + \frac{D-k}{D-1}\frac{1}{k}\left[ (1-\rho^2)^2 -2A\right]$, where $A\geq 0$ is a function of the data ($u,v$). This formula reveals several key properties: (1) We need $s=1$. (2) The factor $\frac{D-k}{D-1}$ can be highly beneficial in reducing variances. (3) The term $\frac{1}{k}(1-\rho^2)^2$ is a substantial improvement compared with $\frac{1}{k}(1+\rho^2)$, which corresponds to the un-normalized estimator. We illustrate that by letting the $k$ in OPORP to be $k=1$ and repeat the procedure $m$ times, we exactly recover the work of ``very spars random projections'' (VSRP). This immediately leads to a normalized estimator for VSRP which substantially improves the original estimator of VSRP. In summary, with OPORP, the two key steps: (i) the normalization and (ii) the fixed-length binning scheme, have considerably improved the accuracy in estimating the cosine similarity, which is a routine (and crucial) task in modern embedding-based retrieval (EBR) applications.
arXiv.org Artificial Intelligence
May-23-2023
- Country:
- Oceania
- North America
- United States
- District of Columbia > Washington (0.04)
- Maryland > Baltimore (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Texas > Dallas County
- Dallas (0.04)
- New York
- New York County > New York City (0.04)
- Kings County > New York City (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Washington > King County
- Seattle (0.04)
- Pennsylvania
- Philadelphia County > Philadelphia (0.04)
- Allegheny County > Pittsburgh (0.04)
- Alaska > Anchorage Municipality
- Anchorage (0.04)
- California
- San Francisco County > San Francisco (0.14)
- Santa Clara County > Stanford (0.04)
- San Mateo County > Foster City (0.04)
- San Diego County > San Diego (0.04)
- Colorado > Boulder County
- Boulder (0.04)
- Canada
- United States
- Europe
- Italy (0.04)
- Sweden > Stockholm
- Stockholm (0.04)
- Spain
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Andalusia > Granada Province
- Granada (0.04)
- Catalonia > Barcelona Province
- Germany > Bavaria
- Upper Bavaria > Munich (0.04)
- France
- Île-de-France > Paris
- Paris (0.04)
- Hauts-de-France > Nord
- Lille (0.04)
- Île-de-France > Paris
- Asia
- Singapore (0.04)
- Middle East > Qatar
- Japan > Honshū
- Kantō > Kanagawa Prefecture
- Yokohama (0.04)
- Kansai > Osaka Prefecture
- Osaka (0.04)
- Kantō > Kanagawa Prefecture
- China
- Afghanistan > Parwan Province
- Charikar (0.04)
- Africa
- The Gambia (0.04)
- Ethiopia > Addis Ababa
- Addis Ababa (0.04)
- Genre:
- Research Report (0.81)
- Industry:
- Information Technology > Security & Privacy (0.46)
- Technology: