microcluster
Hierarchical Sparse Representation Clustering for High-Dimensional Data Streams
Chen, Jie, Mao, Hua, Gou, Yuanbiao, Peng, Xi
Data stream clustering reveals patterns within continuously arriving, potentially unbounded data sequences. Numerous data stream algorithms have been proposed to cluster data streams. The existing data stream clustering algorithms still face significant challenges when addressing high-dimensional data streams. First, it is intractable to measure the similarities among high-dimensional data objects via Euclidean distances when constructing and merging microclusters. Second, these algorithms are highly sensitive to the noise contained in high-dimensional data streams. In this paper, we propose a hierarchical sparse representation clustering (HSRC) method for clustering high-dimensional data streams. HSRC first employs an $l_1$-minimization technique to learn an affinity matrix for data objects in individual landmark windows with fixed sizes, where the number of neighboring data objects is automatically selected. This approach ensures that highly correlated data samples within clusters are grouped together. Then, HSRC applies a spectral clustering technique to the affinity matrix to generate microclusters. These microclusters are subsequently merged into macroclusters based on their sparse similarity degrees (SSDs). Additionally, HSRC introduces sparsity residual values (SRVs) to adaptively select representative data objects from the current landmark window. These representatives serve as dictionary samples for the next landmark window. Finally, HSRC refines each macrocluster through fine-tuning. In particular, HSRC enables the detection of outliers in high-dimensional data streams via the associated SRVs. The experimental results obtained on several benchmark datasets demonstrate the effectiveness and robustness of HSRC.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > China > Sichuan Province > Chengdu (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (8 more...)
McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets
Vinces, Braulio V. Sánchez, Cordeiro, Robson L. F., Faloutsos, Christos
How could we have an outlier detector that works even with nondimensional data, and ranks together both singleton microclusters ('one-off' outliers) and nonsingleton microclusters by their anomaly scores? How to obtain scores that are principled in one scalable and 'hands-off' manner? Microclusters of outliers indicate coalition or repetition in fraud activities, etc.; their identification is thus highly desirable. This paper presents McCatch: a new algorithm that detects microclusters by leveraging our proposed 'Oracle' plot (1NN Distance versus Group 1NN Distance). We study 31 real and synthetic datasets with up to 1M data elements to show that McCatch is the only method that answers both of the questions above; and, it outperforms 11 other methods, especially when the data has nonsingleton microclusters or is nondimensional. We also showcase McCatch's ability to detect meaningful microclusters in graphs, fingerprints, logs of network connections, text data, and satellite imagery. For example, it found a 30-elements microcluster of confirmed 'Denial of Service' attacks in the network logs, taking only ~3 minutes for 222K data elements on a stock desktop.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (7 more...)
Minimum Spectral Connectivity Projection Pursuit
Hofmeyr, David P., Pavlidis, Nicos G., Eckley, Idris A.
We study the problem of determining the optimal low dimensional projection for maximising the separability of a binary partition of an unlabelled dataset, as measured by spectral graph theory. This is achieved by finding projections which minimise the second eigenvalue of the graph Laplacian of the projected data, which corresponds to a non-convex, non-smooth optimisation problem. We show that the optimal univariate projection based on spectral connectivity converges to the vector normal to the maximum margin hyperplane through the data, as the scaling parameter is reduced to zero. This establishes a connection between connectivity as measured by spectral graph theory and maximal Euclidean separation. The computational cost associated with each eigen-problem is quadratic in the number of data. To mitigate this issue, we propose an approximation method using microclusters with provable approximation error bounds. Combining multiple binary partitions within a divisive hierarchical model allows us to construct clustering solutions admitting clusters with varying scales and lying within different subspaces. We evaluate the performance of the proposed method on a large collection of benchmark datasets and find that it compares favourably with existing methods for projection pursuit and dimension reduction for data clustering.
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Wisconsin (0.04)
- (2 more...)
Robust Ensemble Clustering Using Probability Trajectories
Huang, Dong, Lai, Jian-Huang, Wang, Chang-Dong
Note that V Y L Link set of G w ij W eight between two nodes in G G K -elite neighbor graph (K -ENG) V Node set of G . Note that V Y L Link set of G w ij W eight between two nodes in G p ij (1-step) transition probability fromy i to y j P (1-step) transition probability matrix,P { p ij } N N p T ij T -step transition probability fromy i to y j P T T -step transition probability matrix,P T { p T ij } N N p T i: The i -th row ofP T, p T i: { p T i 1,···,p T i N} PT T i Probability trajectory of a random walker starting fromnode y i with lengthT PTS ij Probability trajectory based similarity betweeny i and y j R (0) Set of the initial regions for PTA,R (0) { R (0) 1,···,R (0) R (0) } S (0) Initial similarity matrix for PTA,S (0) { s (0) ij } R (0) R (0) R ( t) Set of thet -step regions for PTA, R ( t) { R ( t) 1,···,R ( t) R ( t) } S ( t) The t -step similarity matrix for PTA,S ( t) { s ( t) ij } R ( t) R ( t) G Microcluster-cluster bipartite graph (MCBG) N Number of nodes in G V Node set of G L Link set of G w ij W eight between two nodes in G A sparse graph termedK -elite neighbor graph (K -ENG) is then constructed with only a small number of probably reliable links. The ENS strategy is a crucial step in our approach. W e argue that using a small number of probably reliable links may lead to significantly better consensus results than using all graph links regardless of their reliability . The random walk process driven by a new transition probability matrix is performed on theK -ENG to explore the global structure information.
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Asia > China > Guangdong Province > Zhuhai (0.04)