Unsupervised Learning via Network-Aware Embeddings

Damstrup, Anne Sophie Riis, Madsen, Sofie Tosti, Coscia, Michele

arXiv.org Artificial Intelligence 

Data clustering, the task of grouping observations according to their similarity, is a key component of unsupervised learning - with real world applications in diverse fields such as biology, medicine, and social science. Often in these fields the data comes with complex interdependencies between the dimensions of analysis, for instance the various characteristics and opinions people can have live on a complex social network. Current clustering methods are ill-suited to tackle this complexity: deep learning can approximate these dependencies, but not take their explicit map as the input of the analysis. In this paper, we aim at fixing this blind spot in the unsupervised learning literature. We can create network-aware embeddings by estimating the network distance between numeric node attributes via the generalized Euclidean distance. Differently from all methods in the literature that we know of, we do not cluster the nodes of the network, but rather its node attributes. In our experiments we show that having these network embeddings is always beneficial for the learning task; that our method scales to large networks; and that we can actually provide actionable insights in applications in a variety of fields such as marketing, economics, and political science. Our method is fully open source and data and code are available to reproduce all results in the paper. Finding patterns in unlabeled data - a task known as unsupervised learning - is useful when we need to build understanding from data Hastie et al. (2009). Unsupervised learning includes grouping observations into clusters according to some criterion represented by a quality or loss function Gan et al. (2020) - data clustering. Applications range from grouping of genes with related expression patterns in biology Ranade et al. (2001), finding patterns in tissue images in medicine Filipovych et al. (2011), or segment customers for marketing purposes. Popular data clustering algorithms include DBSCAN Ester et al. (1996), OPTICS Ankerst et al. (1999), k-Means, and more. Modern data clustering approaches rely on deep learning and specifically deep neural networks Aljalbout et al. (2018); Aggarwal et al. (2018); Pang et al. (2021); Ezugwu et al. (2022), or denoising with autoencoders Nawaz et al. (2022); Cai et al. (2022). However, these approaches work in (deformations of) Euclidean spaces - where dependencies between the dimensions of the analysis can be learned Mahalanobis (1936); Xie et al. (2016) -, but the problem to be tackled here is fundamentally non-Euclidean Bronstein et al. (2017). Graph Neural Networks (GNN) Scarselli et al. (2008); Wu et al. (2022); Zhou et al. (2020a) work in non-Euclidean settings, and they are the focus of this paper.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found