Goto

Collaborating Authors

 Clustering


Rethinking the positive role of cluster structure in complex networks for link prediction tasks

arXiv.org Artificial Intelligence

Clustering is a fundamental problem in network analysis that finds closely connected groups of nodes and separates them from other nodes in the graph, while link prediction is to predict whether two nodes in a network are likely to have a link. The definition of both naturally determines that clustering must play a positive role in obtaining accurate link prediction tasks. Yet researchers have long ignored or used inappropriate ways to undermine this positive relationship. In this article, We construct a simple but efficient clustering-driven link prediction framework(ClusterLP), with the goal of directly exploiting the cluster structures to obtain connections between nodes as accurately as possible in both undirected graphs and directed graphs. Specifically, we propose that it is easier to establish links between nodes with similar representation vectors and cluster tendencies in undirected graphs, while nodes in a directed graphs can more easily point to nodes similar to their representation vectors and have greater influence in their own cluster. We customized the implementation of ClusterLP for undirected and directed graphs, respectively, and the experimental results using multiple real-world networks on the link prediction task showed that our models is highly competitive with existing baseline models. The code implementation of ClusterLP and baselines we use are available at https://github.com/ZINUX1998/ClusterLP.


Dual Information Enhanced Multi-view Attributed Graph Clustering

arXiv.org Artificial Intelligence

Multi-view attributed graph clustering is an important approach to partition multi-view data based on the attribute feature and adjacent matrices from different views. Some attempts have been made in utilizing Graph Neural Network (GNN), which have achieved promising clustering performance. Despite this, few of them pay attention to the inherent specific information embedded in multiple views. Meanwhile, they are incapable of recovering the latent high-level representation from the low-level ones, greatly limiting the downstream clustering performance. To fill these gaps, a novel Dual Information enhanced multi-view Attributed Graph Clustering (DIAGC) method is proposed in this paper. Specifically, the proposed method introduces the Specific Information Reconstruction (SIR) module to disentangle the explorations of the consensus and specific information from multiple views, which enables GCN to capture the more essential low-level representations. Besides, the Mutual Information Maximization (MIM) module maximizes the agreement between the latent high-level representation and low-level ones, and enables the high-level representation to satisfy the desired clustering structure with the help of the Self-supervised Clustering (SC) module. Extensive experiments on several real-world benchmarks demonstrate the effectiveness of the proposed DIAGC method compared with the state-of-the-art baselines.


Simplifying Clustering with Graph Neural Networks

arXiv.org Artificial Intelligence

The objective functions used in spectral clustering are usually composed of two terms: i) a term that minimizes the local quadratic variation of the cluster assignments on the graph and; ii) a term that balances the clustering partition and helps avoiding degenerate solutions. This paper shows that a graph neural network, equipped with suitable message passing layers, can generate good cluster assignments by optimizing only a balancing term. Results on attributed graph datasets show the effectiveness of the proposed approach in terms of clustering performance and computation time.


Reducing Computational Complexity of Neural Networks in Optical Channel Equalization: From Concepts to Implementation

arXiv.org Artificial Intelligence

In this paper, a new methodology is proposed that allows for the low-complexity development of neural network (NN) based equalizers for the mitigation of impairments in high-speed coherent optical transmission systems. In this work, we provide a comprehensive description and comparison of various deep model compression approaches that have been applied to feed-forward and recurrent NN designs. Additionally, we evaluate the influence these strategies have on the performance of each NN equalizer. Quantization, weight clustering, pruning, and other cutting-edge strategies for model compression are taken into consideration. In this work, we propose and evaluate a Bayesian optimization-assisted compression, in which the hyperparameters of the compression are chosen to simultaneously reduce complexity and improve performance. In conclusion, the trade-off between the complexity of each compression approach and its performance is evaluated by utilizing both simulated and experimental data in order to complete the analysis. By utilizing optimal compression approaches, we show that it is possible to design an NN-based equalizer that is simpler to implement and has better performance than the conventional digital back-propagation (DBP) equalizer with only one step per span. This is accomplished by reducing the number of multipliers used in the NN equalizer after applying the weighted clustering and pruning algorithms. Furthermore, we demonstrate that an equalizer based on NN can also achieve superior performance while still maintaining the same degree of complexity as the full electronic chromatic dispersion compensation block. We conclude our analysis by highlighting open questions and existing challenges, as well as possible future research directions.


Unsupervised Semantic Analysis of a Region from Satellite Image Time Series

arXiv.org Artificial Intelligence

Temporal sequences of satellite images constitute a highly valuable and abundant resource to analyze a given region. However, the labeled data needed to train most machine learning models are scarce and difficult to obtain. In this context, the current work investigates a fully unsupervised methodology that, given a sequence of images, learns a semantic embedding and then, creates a partition of the ground according to its semantic properties and its evolution over time. We illustrate the methodology by conducting the semantic analysis of a sequence of satellite images of a region of Navarre (Spain). The proposed approach reveals a novel broad perspective of the land, where potentially large areas that share both a similar semantic and a similar temporal evolution are connected in a compact and well-structured manner. The results also show a close relationship between the allocation of the clusters in the geographic space and their allocation in the embedded spaces. The semantic analysis is completed by obtaining the representative sequence of tiles corresponding to each cluster, the linear interpolation between related areas, and a graph that shows the relationships between the clusters, providing a concise semantic summary of the whole region.


Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network Model

arXiv.org Artificial Intelligence

The SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans. Like many coronaviruses, it can adapt to different hosts and evolve into different lineages. It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein. Understanding the spike protein structure and how it can be perturbed is vital for understanding and determining if a lineage is of concern. These are crucial to identifying and controlling current outbreaks and preventing future pandemics. Machine learning (ML) methods are a viable solution to this effort, given the volume of available sequencing data, much of which is unaligned or even unassembled. However, such ML methods require fixed-length numerical feature vectors in Euclidean space to be applicable. Similarly, euclidean space is not considered the best choice when working with the classification and clustering tasks for biological sequences. For this purpose, we design a method that converts the protein (spike) sequences into the sequence similarity network (SSN). We can then use SSN as an input for the classical algorithms from the graph mining domain for the typical tasks such as classification and clustering to understand the data. We show that the proposed alignment-free method is able to outperform the current SOTA method in terms of clustering results. Similarly, we are able to achieve higher classification accuracy using well-known Node2Vec-based embedding compared to other baseline embedding approaches.


Scalable and Effective Conductance-based Graph Clustering

arXiv.org Artificial Intelligence

Conductance-based graph clustering has been recognized as a fundamental operator in numerous graph analysis applications. Despite the significant success of conductance-based graph clustering, existing algorithms are either hard to obtain satisfactory clustering qualities, or have high time and space complexity to achieve provable clustering qualities. To overcome these limitations, we devise a powerful \textit{peeling}-based graph clustering framework \textit{PCon}. We show that many existing solutions can be reduced to our framework. Namely, they first define a score function for each vertex, then iteratively remove the vertex with the smallest score. Finally, they output the result with the smallest conductance during the peeling process. Based on our framework, we propose two novel algorithms \textit{PCon\_core} and \emph{PCon\_de} with linear time and space complexity, which can efficiently and effectively identify clusters from massive graphs with more than a few billion edges. Surprisingly, we prove that \emph{PCon\_de} can identify clusters with near-constant approximation ratio, resulting in an important theoretical improvement over the well-known quadratic Cheeger bound. Empirical results on real-life and synthetic datasets show that our algorithms can achieve 5$\sim$42 times speedup with a high clustering accuracy, while using 1.4$\sim$7.8 times less memory than the baseline algorithms.


Linearization and Identification of Multiple-Attractor Dynamical Systems through Laplacian Eigenmaps

arXiv.org Artificial Intelligence

Dynamical Systems (DS) are fundamental to the modeling and understanding time evolving phenomena, and have application in physics, biology and control. As determining an analytical description of the dynamics is often difficult, data-driven approaches are preferred for identifying and controlling nonlinear DS with multiple equilibrium points. Identification of such DS has been treated largely as a supervised learning problem. Instead, we focus on an unsupervised learning scenario where we know neither the number nor the type of dynamics. We propose a Graph-based spectral clustering method that takes advantage of a velocity-augmented kernel to connect data points belonging to the same dynamics, while preserving the natural temporal evolution. We study the eigenvectors and eigenvalues of the Graph Laplacian and show that they form a set of orthogonal embedding spaces, one for each sub-dynamics. We prove that there always exist a set of 2-dimensional embedding spaces in which the sub-dynamics are linear and n-dimensional embedding spaces where they are quasi-linear. We compare the clustering performance of our algorithm to Kernel K-Means, Spectral Clustering and Gaussian Mixtures and show that, even when these algorithms are provided with the correct number of sub-dynamics, they fail to cluster them correctly. We learn a diffeomorphism from the Laplacian embedding space to the original space and show that the Laplacian embedding leads to good reconstruction accuracy and a faster training time through an exponential decaying loss compared to the state-of-the-art diffeomorphism-based approaches.


Big Earth Data and Machine Learning for Sustainable and Resilient Agriculture

arXiv.org Artificial Intelligence

Big streams of Earth images from satellites or other platforms (e.g., drones and mobile phones) are becoming increasingly available at low or no cost and with enhanced spatial and temporal resolution. This thesis recognizes the unprecedented opportunities offered by the high quality and open access Earth observation data of our times and introduces novel machine learning and big data methods to properly exploit them towards developing applications for sustainable and resilient agriculture. The thesis addresses three distinct thematic areas, i.e., the monitoring of the Common Agricultural Policy (CAP), the monitoring of food security and applications for smart and resilient agriculture. The methodological innovations of the developments related to the three thematic areas address the following issues: i) the processing of big Earth Observation (EO) data, ii) the scarcity of annotated data for machine learning model training and iii) the gap between machine learning outputs and actionable advice. This thesis demonstrated how big data technologies such as data cubes, distributed learning, linked open data and semantic enrichment can be used to exploit the data deluge and extract knowledge to address real user needs. Furthermore, this thesis argues for the importance of semi-supervised and unsupervised machine learning models that circumvent the ever-present challenge of scarce annotations and thus allow for model generalization in space and time. Specifically, it is shown how merely few ground truth data are needed to generate high quality crop type maps and crop phenology estimations. Finally, this thesis argues there is considerable distance in value between model inferences and decision making in real-world scenarios and thereby showcases the power of causal and interpretable machine learning in bridging this gap.


Hierarchical Clustering in Machine Learning - Javatpoint

#artificialintelligence

Hierarchical Clustering in Machine Learning with Machine Learning Tutorial, Machine Learning Introduction, What is Machine Learning, Data Machine Learning, Machine Learning vs Artificial Intelligence etc.