RETSim: Resilient and Efficient Text Similarity
Zhang, Marina, Vallis, Owen, Bumin, Aysegul, Vakharia, Tanay, Bursztein, Elie
–arXiv.org Artificial Intelligence
This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. We also introduce the W4NT3D benchmark (Wiki-40B 4dversarial Near-T3xt Dataset) for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings. RETSim and the W4NT3D benchmark are open-sourced under the MIT License at https://github.com/google/unisim.
arXiv.org Artificial Intelligence
Nov-28-2023
- Country:
- North America > United States > Montana (0.15)
- Genre:
- Personal > Obituary (0.68)
- Research Report (0.41)
- Industry:
- Government > Regional Government (0.46)
- Information Technology > Security & Privacy (0.68)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.46)
- Media (0.68)
- Technology: