Goto

Collaborating Authors

 Clustering


Region2Vec: Community Detection on Spatial Networks Using Graph Embedding with Node Attributes and Spatial Interactions

arXiv.org Artificial Intelligence

Community Detection algorithms are used to detect densely connected components in complex networks and reveal underlying relationships among components. As a special type of networks, spatial networks are usually generated by the connections among geographic regions. Identifying the spatial network communities can help reveal the spatial interaction patterns, understand the hidden regional structures and support regional development decision-making. Given the recent development of Graph Convolutional Networks (GCN) and its powerful performance in identifying multi-scale spatial interactions, we proposed an unsupervised GCN-based community detection method "region2vec" on spatial networks. Our method first generates node embeddings for regions that share common attributes and have intense spatial interactions, and then applies clustering algorithms to detect communities based on their embedding similarity and spatial adjacency. Experimental results show that while existing methods trade off either attribute similarities or spatial interactions for one another, "region2vec" maintains a great balance between both and performs the best when one wants to maximize both attribute similarities and spatial interactions within communities.


Modeling and Mining Multi-Aspect Graphs With Scalable Streaming Tensor Decomposition

arXiv.org Artificial Intelligence

Graphs emerge in almost every real-world application domain, ranging from online social networks all the way to health data and movie viewership patterns. Typically, such real-world graphs are big and dynamic, in the sense that they evolve over time. Furthermore, graphs usually contain multi-aspect information i.e. in a social network, we can have the "means of communication" between nodes, such as who messages whom, who calls whom, and who comments on whose timeline and so on. How can we model and mine useful patterns, such as communities of nodes in that graph, from such multi-aspect graphs? How can we identify dynamic patterns in those graphs, and how can we deal with streaming data, when the volume of data to be processed is very large? In order to answer those questions, in this thesis, we propose novel tensor-based methods for mining static and dynamic multi-aspect graphs. In general, a tensor is a higher-order generalization of a matrix that can represent high-dimensional multi-aspect data such as time-evolving networks, collaboration networks, and spatio-temporal data like Electroencephalography (EEG) brain measurements. The thesis is organized in two synergistic thrusts: First, we focus on static multi-aspect graphs, where the goal is to identify coherent communities and patterns between nodes by leveraging the tensor structure in the data. Second, as our graphs evolve dynamically, we focus on handling such streaming updates in the data without having to re-compute the decomposition, but incrementally update the existing results.


An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH

arXiv.org Artificial Intelligence

Training of Machine Learning (ML) models in real contexts often deals with big data sets and high-class imbalance samples where the class of interest is unrepresented (minority class). Practical solutions using classical ML models address the problem of large data sets using parallel/distributed implementations of training algorithms, approximate model-based solutions, or applying instance selection (IS) algorithms to eliminate redundant information. However, the combined problem of big and high imbalanced datasets has been less addressed. This work proposes three new methods for IS to be able to deal with large and imbalanced data sets. The proposed methods use Locality Sensitive Hashing (LSH) as a base clustering technique, and then three different sampling methods are applied on top of the clusters (or buckets) generated by LSH. The algorithms were developed in the Apache Spark framework, guaranteeing their scalability. The experiments carried out in three different datasets suggest that the proposed IS methods can improve the performance of a base ML model between 5% and 19% in terms of the geometric mean.


Deep Clustering: A Comprehensive Survey

arXiv.org Artificial Intelligence

Cluster analysis plays an indispensable role in machine learning and data mining. Learning a good data representation is crucial for clustering algorithms. Recently, deep clustering, which can learn clustering-friendly representations using deep neural networks, has been broadly applied in a wide range of clustering tasks. Existing surveys for deep clustering mainly focus on the single-view fields and the network architectures, ignoring the complex application scenarios of clustering. To address this issue, in this paper we provide a comprehensive survey for deep clustering in views of data sources. With different data sources and initial conditions, we systematically distinguish the clustering methods in terms of methodology, prior knowledge, and architecture. Concretely, deep clustering methods are introduced according to four categories, i.e., traditional single-view deep clustering, semi-supervised deep clustering, deep multi-view clustering, and deep transfer clustering. Finally, we discuss the open challenges and potential future opportunities in different fields of deep clustering.


Simplex Clustering via sBeta with Applications to Online Adjustment of Black-Box Predictions

arXiv.org Artificial Intelligence

We explore clustering the softmax predictions of deep neural networks and introduce a novel probabilistic clustering method, referred to as k-sBetas. In the general context of clustering discrete distributions, the existing methods focused on exploring distortion measures tailored to simplex data, such as the KL divergence, as alternatives to the standard Euclidean distance. We provide a general maximum a posteriori (MAP) perspective of clustering distributions, which emphasizes that the statistical models underlying the existing distortion-based methods may not be descriptive enough. Instead, we optimize a mixed-variable objective measuring the conformity of data within each cluster to the introduced sBeta density function, whose parameters are constrained and estimated jointly with binary assignment variables. Our versatile formulation approximates a variety of parametric densities for modeling simplex data, and enables to control the cluster-balance bias. This yields highly competitive performances for unsupervised adjustments of black-box model predictions in a variety of scenarios. Our code and comparisons with the existing simplex-clustering approaches along with our introduced softmax-prediction benchmarks are publicly available: https://github.com/fchiaroni/Clustering_Softmax_Predictions.


Unsupervised Behaviour Analysis of News Consumption in Turkish Media

arXiv.org Artificial Intelligence

Clickstream data, which come with a massive volume generated by human activities on websites, have become a prominent feature for identifying readers' characteristics by newsrooms after the digitization of news outlets. Although the nature of clickstream data has a similar logic within websites, it has inherent limitations in recognizing human behaviours when looking from a broad perspective, which brings the need to limit the problem in niche areas. This study investigates the anonymized readers' click activities on the organizations' websites to identify news consumption patterns following referrals from Twitter,who incidentally reach but propensity is mainly routed news content. Methodologies for ensemble cluster analysis with mixed-type embedding strategies are applied and compared to find similar reader groups and interests independent of time. Various internal validation perspectives are used to determine the optimality of the quality of clusters, where the Calinski Harabasz Index (CHI) is found to give a generalizable result. Our findings demonstrate that clustering a mixed-type dataset approaches the optimal internal validation scores, which we define to discriminate the clusters and algorithms considering applied strategies when embedded by Uniform Manifold Approximation and Projection (UMAP) and using a consensus function as a key to access the most applicable hyperparameter configurations in the given ensemble rather than using consensus function results directly. Evaluation of the resulting clusters highlights specific clusters repeatedly present in the separated monthly samples by Adjusted Mutual Information scores greater than 0.5, which provide insights to the news organizations and overcome the degradation of the modeling behaviours due to the change in the interest over time.


HAL-X: A Novel Clustering Algorithm for Rapid Single-Cell Data Analysis for Drug Discovery

#artificialintelligence

The novel clustering algorithm HAL-x provides insights into single-cell clustering, bridging the biomedicine sciences for healthcare purposes. With its specificity and sensitivity, it allows the generation of multiple-clustering for in-depth analysis of high-dimensional data. The future directions of algorithms are extremely promising in understanding disease, disease progression, and drug discovery. Throughout the years, clustering has been studied both extensively and intensively in statistics and machine learning. Apart from classical methods, clustering for high-dimensional data is primarily done using non-linear embeddings, which largely lacked visualization and identification of accurate clustering.


Elastic Step DQN: A novel multi-step algorithm to alleviate overestimation in Deep QNetworks

arXiv.org Artificial Intelligence

Deep Q-Networks algorithm (DQN) was the first reinforcement learning algorithm using deep neural network to successfully surpass human level performance in a number of Atari learning environments. However, divergent and unstable behaviour have been long standing issues in DQNs. The unstable behaviour is often characterised by overestimation in the $Q$-values, commonly referred to as the overestimation bias. To address the overestimation bias and the divergent behaviour, a number of heuristic extensions have been proposed. Notably, multi-step updates have been shown to drastically reduce unstable behaviour while improving agent's training performance. However, agents are often highly sensitive to the selection of the multi-step update horizon ($n$), and our empirical experiments show that a poorly chosen static value for $n$ can in many cases lead to worse performance than single-step DQN. Inspired by the success of $n$-step DQN and the effects that multi-step updates have on overestimation bias, this paper proposes a new algorithm that we call `Elastic Step DQN' (ES-DQN). It dynamically varies the step size horizon in multi-step updates based on the similarity of states visited. Our empirical evaluation shows that ES-DQN out-performs $n$-step with fixed $n$ updates, Double DQN and Average DQN in several OpenAI Gym environments while at the same time alleviating the overestimation bias.


Data-driven Approach to Differentiating between Depression and Dementia from Noisy Speech and Language Data

arXiv.org Artificial Intelligence

A significant number of studies apply acoustic and linguistic characteristics of human speech as prominent markers of dementia and depression. However, studies on discriminating depression from dementia are rare. Co-morbid depression is frequent in dementia and these clinical conditions share many overlapping symptoms, but the ability to distinguish between depression and dementia is essential as depression is often curable. In this work, we investigate the ability of clustering approaches in distinguishing between depression and dementia from human speech. We introduce a novel aggregated dataset, which combines narrative speech data from multiple conditions, i.e., Alzheimer's disease, mild cognitive impairment, healthy control, and depression. We compare linear and non-linear clustering approaches and show that non-linear clustering techniques distinguish better between distinct disease clusters. Our interpretability analysis shows that the main differentiating symptoms between dementia and depression are acoustic abnormality, repetitiveness (or circularity) of speech, word finding difficulty, coherence impairment, and differences in lexical complexity and richness.


Reservoir Computing Approach for Gray Images Segmentation

arXiv.org Artificial Intelligence

The paper proposes a novel approach for gray scale images segmentation. It is based on multiple features extraction from single feature per image pixel, namely its intensity value, using Echo state network. The newly extracted features - reservoir equilibrium states - reveal hidden image characteristics that improve its segmentation via a clustering algorithm. Moreover, it was demonstrated that the intrinsic plasticity tuning of reservoir fits its equilibrium states to the original image intensity distribution thus allowing for its better segmentation. The proposed approach is tested on the benchmark image Lena.