Clustering idea for very large datasets
Let's say you have to cluster 10 million points, for instance keywords. So, in short, you can perform k-NN (k-nearest neighbors) clustering or some other types of clustering, which typically is O(n 2) or worse, from a computational complexity point of view. Has anyone ever used a clustering method based on sampling? The idea is to start by sampling 1% (or less) of the 100,000,000 entries, and perform clustering on these pairs of keywords, to create a "seed" or "baseline" cluster structure. The next step is to browse sequentially your 10,000,000 keywords, and for each keyword, find the closest cluster from the baseline cluster structure.
Apr-22-2016, 10:00:27 GMT
- Technology: