Objective-Based Hierarchical Clustering of Deep Embedding Vectors

Naumov, Stanislav, Yaroslavtsev, Grigory, Avdiukhin, Dmitrii

Dec-15-2020–arXiv.org Machine Learning

We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to $4.5$ million entries with embedding dimensions up to $2048$. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a $0.74$-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial $2/3$-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of $\approx 2/3 + 0.0004$ was due to Charikar et al. (SODA'19).

algorithm, dataset, objective, (16 more...)

arXiv.org Machine Learning

Dec-15-2020

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Kansas (0.04)
    - Indiana (0.04)
    - New York > New York County
      - New York City (0.04)
    - California
      - Santa Clara County > Palo Alto (0.04)
      - Alameda County > Berkeley (0.04)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Netherlands > North Holland
    - Amsterdam (0.04)
  - Austria > Styria
    - Graz (0.04)
- Asia > Afghanistan
  - Parwan Province > Charikar (0.25)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found