Du, Liang
Sharper Error Bounds in Late Fusion Multi-view Clustering Using Eigenvalue Proportion
Du, Liang, Jiang, Henghui, Li, Xiaodong, Guo, Yiqing, Chen, Yan, Li, Feijiang, Zhou, Peng, Qian, Yuhua
Multi-view clustering (MVC) aims to integrate complementary information from multiple views to enhance clustering performance. Late Fusion Multi-View Clustering (LFMVC) has shown promise by synthesizing diverse clustering results into a unified consensus. However, current LFMVC methods struggle with noisy and redundant partitions and often fail to capture high-order correlations across views. To address these limitations, we present a novel theoretical framework for analyzing the generalization error bounds of multiple kernel $k$-means, leveraging local Rademacher complexity and principal eigenvalue proportions. Our analysis establishes a convergence rate of $\mathcal{O}(1/n)$, significantly improving upon the existing rate in the order of $\mathcal{O}(\sqrt{k/n})$. Building on this insight, we propose a low-pass graph filtering strategy within a multiple linear $k$-means framework to mitigate noise and redundancy, further refining the principal eigenvalue proportion and enhancing clustering accuracy. Experimental results on benchmark datasets confirm that our approach outperforms state-of-the-art methods in clustering performance and robustness. The related codes is available at https://github.com/csliangdu/GMLKM .
k-HyperEdge Medoids for Clustering Ensemble
Li, Feijiang, Wang, Jieting, zhang, Liuya, Qian, Yuhua, jin, Shuai, Yan, Tao, Du, Liang
Clustering ensemble has been a popular research topic in data science due to its ability to improve the robustness of the single clustering method. Many clustering ensemble methods have been proposed, most of which can be categorized into clustering-view and sample-view methods. The clustering-view method is generally efficient, but it could be affected by the unreliability that existed in base clustering results. The sample-view method shows good performance, while the construction of the pairwise sample relation is time-consuming. In this paper, the clustering ensemble is formulated as a k-HyperEdge Medoids discovery problem and a clustering ensemble method based on k-HyperEdge Medoids that considers the characteristics of the above two types of clustering ensemble methods is proposed. In the method, a set of hyperedges is selected from the clustering view efficiently, then the hyperedges are diffused and adjusted from the sample view guided by a hyperedge loss function to construct an effective k-HyperEdge Medoid set. The loss function is mainly reduced by assigning samples to the hyperedge with the highest degree of belonging. Theoretical analyses show that the solution can approximate the optimal, the assignment method can gradually reduce the loss function, and the estimation of the belonging degree is statistically reasonable. Experiments on artificial data show the working mechanism of the proposed method. The convergence of the method is verified by experimental analysis of twenty data sets. The effectiveness and efficiency of the proposed method are also verified on these data, with nine representative clustering ensemble algorithms as reference.
Clustering ensemble algorithm with high-order consistency learning
Gan, Jianwen, Chen, Yan, Zhou, Peng, Du, Liang
Most of the research on clustering ensemble focuses on designing practical consistency learning algorithms.To solve the problems that the quality of base clusters varies and the low-quality base clusters have an impact on the performance of the clustering ensemble, from the perspective of data mining, the intrinsic connections of data were mined based on the base clusters, and a high-order information fusion algorithm was proposed to represent the connections between data from different dimensions, namely Clustering Ensemble with High-order Consensus learning (HCLCE). Firstly, each high-order information was fused into a new structured consistency matrix. Then, the obtained multiple consistency matrices were fused together. Finally, multiple information was fused into a consistent result. Experimental results show that LCLCE algorithm has the clustering accuracy improved by an average of 7.22%, and the Normalized Mutual Information (NMI) improved by an average of 9.19% compared with the suboptimal Locally Weighted Evidence Accumulation (LWEA) algorithm. It can be seen that the proposed algorithm can obtain better clustering results compared with clustering ensemble algorithms and using one information alone.
Unsupervised Feature Selection Algorithm Based on Graph Filtering and Self-representation
Liang, Yunhui, Gan, Jianwen, Chen, Yan, Zhou, Peng, Du, Liang
Aiming at the problem that existing methods could not fully capture the intrinsic structure of data without considering the higher-order neighborhood information of the data, we proposed an unsupervised feature selection algorithm based on graph filtering and self-representation. Firstly,a higher-order graph filter was applied to the data to obtain its smooth representation,and a regularizer was designed to combine the higher-order graph information for the self-representation matrix learning to capture the intrinsic structure of the data. Secondly,l2,1 norm was used to reconstruct the error term and feature selection matrix to enhance the robustness and row sparsity of the model to select the discriminant features. Finally, an iterative algorithm was applied to effectively solve the proposed objective function and simulation experiments were carried out to verify the effectiveness of the proposed algorithm.
Multiple kernel concept factorization algorithm based on global fusion
Li, Fei, Du, Liang, Ren, Chaohong
Abstract: Non-negative Matrix Factorization (NMF) algorithm can only be used to find low rank approximation of original non-negative data while Concept Factorization (CF) algorithm extends matrix factorization to single non-linear kernel space, improving learning ability and adaptability of matrix factorization. In unsupervised environment, to design or select proper kernel function for specific dataset, a new algorithm called Globalized Multiple Kernel CF (GMKCF) was proposed. Multiple candidate kernel functions were input in the same time and learned in the CF framework based on global linear fusion, obtaining a clustering result with high quality and stability and solving the problem of kernel function selection that the CF faced. The convergence of the proposed algorithm was verified by solving the model with alternate iteration. The experimental results on several real databases show that the proposed algorithm outperforms comparison algorithms in data clustering, such as Kernel K-Means (KKM), Spectral Clustering (SC), Kernel CF (KCF), Co-regularized multi-view spectral clustering (Coreg), and Robust Multiple KKM (RMKKM).
Hierarchical Multiple Kernel K-Means Algorithm Based on Sparse Connectivity
Wang, Lei, Du, Liang, Zhou, Peng
Multiple kernel learning (MKL) aims to find an optimal, consistent kernel function. In the hierarchical multiple kernel clustering (HMKC) algorithm, sample features are extracted layer by layer from a high-dimensional space to maximize the retention of effective information. However, information interaction between layers is often ignored. In this model, only corresponding nodes in adjacent layers exchange information; other nodes remain isolated, and if full connectivity is adopted, the diversity of the final consistency matrix is reduced. Therefore, this paper proposes a hierarchical multiple kernel K-Means (SCHMKKM) algorithm based on sparse connectivity, which controls the assignment matrix to achieve sparse connections through a sparsity rate, thereby locally fusing the features obtained by distilling information between layers. Finally, we conduct cluster analysis on multiple datasets and compare it with the fully connected hierarchical multiple kernel K-Means (FCHMKKM) algorithm in experiments. It is shown that more discriminative information fusion is beneficial for learning a better consistent partition matrix, and the fusion strategy based on sparse connection outperforms the full connection strategy.
Unsupervised Feature Selection Algorithm Based on Dual Manifold Re-ranking
Liang, Yunhui, Gan, Jianwen, Chen, Yan, Zhou, Peng, Du, Liang
High-dimensional data is commonly encountered in numerous data analysis tasks. Feature selection techniques aim to identify the most representative features from the original high-dimensional data. Due to the absence of class label information, it is significantly more challenging to select appropriate features in unsupervised learning scenarios compared to supervised ones. Traditional unsupervised feature selection methods typically score the features of samples based on certain criteria, treating samples indiscriminately. However, these approaches fail to fully capture the internal structure of the data. The importance of different samples should vary, and there is a dual relationship between the weight of samples and features that will influence each other. Therefore, an unsupervised feature selection algorithm based on dual manifold re-ranking (DMRR) is proposed in this paper. Different similarity matrices are constructed to depict the manifold structures among samples, between samples and features, and among features themselves. Then, manifold re-ranking is performed by combining the initial scores of samples and features. By comparing DMRR with three original unsupervised feature selection algorithms and two unsupervised feature selection post-processing algorithms, experimental results confirm that the importance information of different samples and the dual relationship between sample and feature are beneficial for achieving better feature selection.
Unsupervised feature selection algorithm framework based on neighborhood interval disturbance fusion
Lv, Xiaolin, Du, Liang, Zhou, Peng, Wu, Peng
Feature selection technology is a key technology of data dimensionality reduction. Becauseof the lack of label information of collected data samples, unsupervised feature selection has attracted more attention. The universality and stability of many unsupervised feature selection algorithms are very low and greatly affected by the dataset structure. For this reason, many researchers have been keen to improve the stability of the algorithm. This paper attempts to preprocess the data set and use an interval method to approximate the data set, experimentally verifying the advantages and disadvantages of the new interval data set. This paper deals with these data sets from the global perspective and proposes a new algorithm-unsupervised feature selection algorithm based on neighborhood interval disturbance fusion(NIDF). This method can realize the joint learning of the final score of the feature and the approximate data interval. By comparing with the original unsupervised feature selection methods and several existing feature selection frameworks, the superiority of the proposed model is verified.
Multiple Kernel Clustering via Local Regression Integration
Du, Liang, Ren, Xin, Zhang, Haiying, Zhou, Peng
Multiple kernel methods less consider the intrinsic manifold structure of multiple kernel data and estimate the consensus kernel matrix with quadratic number of variables, which makes it vulnerable to the noise and outliers within multiple candidate kernels. This paper first presents the clustering method via kernelized local regression (CKLR). It captures the local structure of kernel data and employs kernel regression on the local region to predict the clustering results. Moreover, this paper further extends it to perform clustering via the multiple kernel local regression (CMKLR). We construct the kernel level local regression sparse coefficient matrix for each candidate kernel, which well characterizes the kernel level manifold structure. We then aggregate all the kernel level local regression coefficients via linear weights and generate the consensus sparse local regression coefficient, which largely reduces the number of candidate variables and becomes more robust against noises and outliers within multiple kernel data. Thus, the proposed method CMKLR avoids the above two limitations. It only contains one additional hyperparameter for tuning. Extensive experimental results show that the clustering performance of the proposed method on benchmark datasets is better than that of 10 state-of-the-art multiple kernel clustering methods.
Symmetry Nonnegative Matrix Factorization Algorithm Based on Self-paced Learning
Wang, Lei, Du, Liang, Zhou, Peng, Wu, Peng
A symmetric nonnegative matrix factorization algorithm based on self-paced learning was proposed to improve the clustering performance of the model. It could make the model better distinguish normal samples from abnormal samples in an error-driven way. A weight variable that could measure the degree of difficulty to all samples was assigned in this method, and the variable was constrained by adopting both hard-weighting and soft-weighting strategies to ensure the rationality of the model. Cluster analysis was carried out on multiple data sets such as images and texts, and the experimental results showed the effectiveness of the proposed algorithm.