Clustering
On the use of local structural properties for improving the efficiency of hierarchical community detection methods
Palacio-Niño, Julio-Omar, Berzal, Fernando
Community detection is a fundamental problem in the analysis of complex networks. It is the analogue of clustering in network data mining. Within community detection methods, hierarchical algorithms are popular. However, their iterative nature and the need to recompute the structural properties used to split the network (i.e. edge betweenness in Girvan and Newman's algorithm), make them unsuitable for large network data sets. In this paper, we study how local structural network properties can be used as proxies to improve the efficiency of hierarchical community detection while, at the same time, achieving competitive results in terms of modularity. In particular, we study the potential use of the structural properties commonly used to perform local link prediction, a supervised learning problem where community structure is relevant, as nodes are prone to establish new links with other nodes within their communities. In addition, we check the performance impact of network pruning heuristics as an ancillary tactic to make hierarchical community detection more efficient
Detecci\'on de comunidades en redes: Algoritmos y aplicaciones
This master's thesis work has the objective of performing an analysis of the methods for detecting communities in networks. As an initial part, I study of the main features of graph theory and communities, as well as common measures in this problem. Subsequently, I was performed a review of the main methods of detecting communities, developing a classification, taking into account its characteristics and computational complexity for the detection of strengths and weaknesses in the methods, as well as later works. Then, study the problem of classification of a clustering method, this in order to evaluate the quality of the communities detected by analyzing different measures. Finally conclusions are elaborated and possible lines of work that can be derived.
10 Machine Learning Algorithms You Need to Know
If you've just started to explore the ways that machine learning can impact your business, the first questions you're likely to come across are what are all of the different types of machine learning algorithms, what are they good for, and which one should I choose for my project? This post will help you answer those questions. There are a few different ways to categorize machine learning algorithms. One way is based on what the training data looks like. Another way to classify algorithms--and one that's more practical from a business perspective--is to categorize them based on how they work and what kinds of problems they can solve, which is what we'll do here.
Zone pAth Construction (ZAC) based Approaches for Effective Real-Time Ridesharing
Lowalekar, Meghna, Varakantham, Pradeep, Jaillet, Patrick
Real-time ridesharing systems such as UberPool, Lyft Line, GrabShare have become hugely popular as they reduce the costs for customers, improve per trip revenue for drivers and reduce traffic on the roads by grouping customers with similar itineraries. The key challenge in these systems is to group the "right" requests to travel together in the "right" available vehicles in real-time, so that the objective (e.g., requests served, revenue or delay) is optimized. This challenge has been addressed in existing work by: (i) generating as many relevant feasible (with respect to the available delay for customers) combinations of requests as possible in real-time; and then (ii) optimizing assignment of the feasible request combinations to vehicles. Since the number of request combinations increases exponentially with the increase in vehicle capacity and number of requests, unfortunately, such approaches have to employ ad hoc heuristics to identify a subset of request combinations for assignment. Our key contribution is in developing approaches that employ zone (abstraction of individual locations) paths instead of request combinations. Zone paths allow for generation of significantly more "relevant" combinations (in comparison to ad hoc heuristics) in real-time than competing approaches due to two reasons: (i) Each zone path can typically represent multiple request combinations; (ii) Zone paths are generated using a combination of offline and online methods. Specifically, we contribute both myopic (ridesharing assignment focussed on current requests only) and non-myopic (ridesharing assignment considers impact on expected future requests) approaches that employ zone paths. In our experimental results, we demonstrate that our myopic approach outperforms (with respect to both objective and runtime) the current best myopic approach for ridesharing on both real-world and synthetic datasets.
Clustering of non-Gaussian data by variational Bayes for normal inverse Gaussian mixture models
Finite mixture models, typically Gaussian mixtures, are well known and widely used as model-based clustering. In practical situations, there are many non-Gaussian data that are heavy-tailed and/or asymmetric. Normal inverse Gaussian (NIG) distributions are normal-variance mean which mixing densities are inverse Gaussian distributions and can be used for both haavy-tail and asymmetry. For NIG mixture models, both expectation-maximization method and variational Bayesian (VB) algorithms have been proposed. However, the existing VB algorithm for NIG mixture have a disadvantage that the shape of the mixing density is limited. In this paper, we propose another VB algorithm for NIG mixture that improves on the shortcomings. We also propose an extension of Dirichlet process mixture models to overcome the difficulty in determining the number of clusters in finite mixture models. We evaluated the performance with artificial data and found that it outperformed Gaussian mixtures and existing implementations for NIG mixtures, especially for highly non-normative data.
Smoothness Sensor: Adaptive Smoothness-Transition Graph Convolutions for Attributed Graph Clustering
Ji, Chaojie, Chen, Hongwei, Wang, Ruxin, Cai, Yunpeng, Wu, Hongyan
Clustering techniques attempt to group objects with similar properties into a cluster. Clustering the nodes of an attributed graph, in which each node is associated with a set of feature attributes, has attracted significant attention. Graph convolutional networks (GCNs) represent an effective approach for integrating the two complementary factors of node attributes and structural information for attributed graph clustering. However, oversmoothing of GCNs produces indistinguishable representations of nodes, such that the nodes in a graph tend to be grouped into fewer clusters, and poses a challenge due to the resulting performance drop. In this study, we propose a smoothness sensor for attributed graph clustering based on adaptive smoothness-transition graph convolutions, which senses the smoothness of a graph and adaptively terminates the current convolution once the smoothness is saturated to prevent oversmoothing. Furthermore, as an alternative to graph-level smoothness, a novel fine-gained node-wise level assessment of smoothness is proposed, in which smoothness is computed in accordance with the neighborhood conditions of a given node at a certain order of graph convolution. In addition, a self-supervision criterion is designed considering both the tightness within clusters and the separation between clusters to guide the whole neural network training process. Experiments show that the proposed methods significantly outperform 12 other state-of-the-art baselines in terms of three different metrics across four benchmark datasets. In addition, an extensive study reveals the reasons for their effectiveness and efficiency.
The use of Recommender Systems in web technology and an in-depth analysis of Cold State problem
Selimi, Denis, Nuci, Krenare Pireva
In the WWW (World Wide Web), dynamic development and spread of data has resulted a tremendous amount of information available on the Internet, yet user is unable to find relevant information in a short span of time. Consequently, a system called recommendation system developed to help users find their infromation with ease through their browsing activities. In other words, recommender systems are tools for interacting with large amount of information that provide personalized view for prioritizing items likely to be of keen for users. They have developed over the years in artificial intelligence techniques that include machine learning and data mining amongst many to mention. Furthermore, the recommendation systems have personalized on an e-commerce, on-line applications such as Amazon.com, Netflix, and Booking.com. As a result, this has inspired many researchers to extend the reach of recommendation systems into new sets of challenges and problem areas that are yet to be truly solved, primarily a problem with the case of making a recommendation to a new user that is called cold-state (i.e. cold-start) user problem where the new user might likely not yield much of information searched. Therfore, the purpose of this paper is to tackle the said cold-start problem with a few effecient methods and challenges, as well as identify and overview the current state of recommendation system as a whole
Finding Stable Groups of Cross-Correlated Features in Multi-View data
Dewaskar, Miheer, Palowitch, John, He, Mark, Love, Michael I., Nobel, Andrew
Multi-view data, in which data of different types are obtained from a common set of samples, is now common in many scientific problems. An important problem in the analysis of multi-view data is identifying interactions between groups of features from different data types. A bimodule is a pair $(A,B)$ of feature sets from two different data types such that the aggregate cross-correlation between the features in $A$ and those in $B$ is large. A bimodule $(A,B)$ is stable if $A$ coincides with the set of features having significant aggregate correlation with the features in $B$, and vice-versa. At the population level, stable bimodules correspond to connected components of the cross-correlation network, which is the bipartite graph whose edges are pairs of features with non-zero cross-correlations. We develop an iterative, testing-based procedure, called BSP, to identify stable bimodules in two moderate- to high-dimensional data sets. BSP relies on permutation-based p-values for sums of squared cross-correlations. We efficiently approximate the p-values using tail probabilities of gamma distributions that are fit using analytical estimates of the permutation moments of the test statistic. Our moment estimates depend on the eigenvalues of the intra-correlation matrices of $A$ and $B$ and as a result, the significance of observed cross-correlations accounts for the correlations within each data type. We carry out a thorough simulation study to assess the performance of BSP, and present an extended application of BSP to the problem of expression quantitative trait loci (eQTL) analysis using recent data from the GTEx project. In addition, we apply BSP to climatology data in order to identify regions in North America where annual temperature variation affects precipitation.
Spectral Clustering with Smooth Tiny Clusters
Wang, Hengrui, Zhang, Yubo, Chen, Mingzhi, Yang, Tong
Spectral clustering is one of the most prominent clustering approaches. The distance-based similarity is the most widely used method for spectral clustering. However, people have already noticed that this is not suitable for multi-scale data, as the distance varies a lot for clusters with different densities. State of the art(ROSC and CAST ) addresses this limitation by taking the reachability similarity of objects into account. However, we observe that in real-world scenarios, data in the same cluster tend to present in a smooth manner, and previous algorithms never take this into account. Based on this observation, we propose a novel clustering algorithm, which con-siders the smoothness of data for the first time. We first divide objects into a great many tiny clusters. Our key idea is to cluster tiny clusters, whose centers constitute smooth graphs. Theoretical analysis and experimental results show that our clustering algorithm significantly outperforms state of the art. Although in this paper, we singly focus on multi-scale situations, the idea of data smoothness can certainly be extended to any clustering algorithms
A Black-box Adversarial Attack for Poisoning Clustering
Cinà, Antonio Emanuele, Torcinovich, Alessandro, Pelillo, Marcello
Clustering algorithms play a fundamental role as tools in decision-making and sensible automation processes. Due to the widespread use of these applications, a robustness analysis of this family of algorithms against adversarial noise has become imperative. To the best of our knowledge, however, only a few works have currently addressed this problem. In an attempt to fill this gap, in this work, we propose a black-box adversarial attack for crafting adversarial samples to test the robustness of clustering algorithms. We formulate the problem as a constrained minimization program, general in its structure and customizable by the attacker according to her capability constraints. We do not assume any information about the internal structure of the victim clustering algorithm, and we allow the attacker to query it as a service only. In the absence of any derivative information, we perform the optimization with a custom approach inspired by the Abstract Genetic Algorithm (AGA). In the experimental part, we demonstrate the sensibility of different single and ensemble clustering algorithms against our crafted adversarial samples on different scenarios. Furthermore, we perform a comparison of our algorithm with a state-of-the-art approach showing that we are able to reach or even outperform its performance. Finally, to highlight the general nature of the generated noise, we show that our attacks are transferable even against supervised algorithms such as SVMs, random forests and neural networks.