Clustering
An efficient density-based clustering algorithm using reverse nearest neighbour
Chowdhury, Stiphen, de Amorim, Renato Cordeiro
Density-based clustering is the task of discovering high-density regions of entities (clusters) that are separated from each other by contiguous regions of low-density. DBSCAN is, arguably, the most popular density-based clustering algorithm. However, its cluster recovery capabilities depend on the combination of the two parameters. In this paper we present a new density-based clustering algorithm which uses reverse nearest neighbour (RNN) and has a single parameter. We also show that it is possible to estimate a good value for this parameter using a clustering validity index. The RNN queries enable our algorithm to estimate densities taking more than a single entity into account, and to recover clusters that are not well-separated or have different densities. Our experiments on synthetic and real-world data sets show our proposed algorithm outperforms DBSCAN and its recent variant ISDBSCAN.
On Human Robot Interaction using Multiple Modes
Humanoid robots have apparently similar body structure like human beings. Due to their technical design, they are sharing the same workspace with humans. They are placed to clean things, to assist old age people, to entertain us and most importantly to serve us. To be acceptable in the household, they must have higher level of intelligence than industrial robots and they must be social and capable of interacting people around it, who are not supposed to be robot specialist. All these come under the field of human robot interaction (HRI). There are various modes like speech, gesture, behavior etc. through which human can interact with robots. To solve all these challenges, a multimodel technique has been introduced where gesture as well as speech is used as a mode of interaction.
Exact Recovery in the Hypergraph Stochastic Block Model: a Spectral Algorithm
We consider the exact recovery problem in the hypergraph stochastic block model (HSBM) with $k$ blocks of equal size. More precisely, we consider a random $d$-uniform hypergraph $H$ with $n$ vertices partitioned into $k$ clusters of size $s = n / k$. Hyperedges $e$ are added independently with probability $p$ if $e$ is contained within a single cluster and $q$ otherwise, where $0 \leq q < p \leq 1$. We present a spectral algorithm which recovers the clusters exactly with high probability, given mild conditions on $n, k, p, q$, and $d$. Our algorithm is based on the adjacency matrix of $H$, which we define to be the symmetric $n \times n$ matrix whose $(u, v)$-th entry is the number of hyperedges containing both $u$ and $v$. To the best of our knowledge, our algorithm is the first to guarantee exact recovery when the number of clusters $k=\Theta(\sqrt{n})$.
Subspace Clustering through Sub-Clusters
Li, Weiwei, Hannig, Jan, Mukherjee, Sayan
The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via first spectral clustering of a small random sample followed by classifying or labeling the remaining out of sample points. The key idea is that this random subset borrows information across the entire data set and that the problem of clustering points can be replaced with the more efficient and robust problem of "clustering sub-clusters". We provide theoretical guarantees for our procedure. The numerical results indicate we outperform other state-of-the-art subspace clustering algorithms with respect to accuracy and speed.
Large-scale Interactive Recommendation with Tree-structured Policy Gradient
Chen, Haokun, Dai, Xinyi, Cai, Han, Zhang, Weinan, Wang, Xuejian, Tang, Ruiming, Zhang, Yuzhou, Yu, Yong
Reinforcement learning (RL) has recently been introduced to interactive recommender systems (IRS) because of its nature of learning from dynamic interactions and planning for long-run performance. As IRS is always with thousands of items to recommend (i.e., thousands of actions), most existing RL-based methods, however, fail to handle such a large discrete action space problem and thus become inefficient. The existing work that tries to deal with the large discrete action space problem by utilizing the deep deterministic policy gradient framework suffers from the inconsistency between the continuous action representation (the output of the actor network) and the real discrete action. To avoid such inconsistency and achieve high efficiency and recommendation effectiveness, in this paper, we propose a Tree-structured Policy Gradient Recommendation (TPGR) framework, where a balanced hierarchical clustering tree is built over the items and picking an item is formulated as seeking a path from the root to a certain leaf of the tree. Extensive experiments on carefully-designed environments based on two real-world datasets demonstrate that our model provides superior recommendation performance and significant efficiency improvement over state-of-the-art methods.
A Survey of Mixed Data Clustering Algorithms
Most of the datasets normally contain either numeric or categorical features. Mixed data comprises of both numeric and categorical features, and they frequently occur in various domains, such as health, finance, marketing, etc. Clustering is often sought on mixed data to find structures and to group similar objects. However, clustering mixed data is challenging because it is difficult to directly apply mathematical operations, such as summation, average etc. on the feature values of these datasets. In this paper, we review various types of mixed data clustering techniques in detail. We present a taxonomy to identify ten types of different mixed data clustering techniques. We also compare the performance of several mixed data clustering methods on publicly available datasets. The paper further identifies challenges in developing different mixed data clustering algorithms and provides guidelines for future directions in this area.
Implementation of Fuzzy C-Means and Possibilistic C-Means Clustering Algorithms, Cluster Tendency Analysis and Cluster Validation
Siddique, Md. Abu Bakr, Arif, Rezoana Bente, Khan, Mohammad Mahmudur Rahman, Ashrafi, Zahidun
Abstract-- In this paper, several two-dimensional clustering scenarios are given. In those scenarios, soft partitioning clustering algorithms (Fuzzy C-means (FCM) and Possibilistic c-means (PCM)) are applied. Afterward, VAT is used to investigate the clustering tendency visually, and then in order of checking cluster validation, three types of indices (e.g., PC, DI, and DBI) were used. After observing the clustering algorithms, it was evident that each of them has its limitations; however, PCM is more robust to noise than FCM as in case of FCM a noise point has to be considered as a member of any of the cluster. The clustering [1-3] is a subfield of data mining technique and it is very effective to pick out useful information from dataset.
Cluster analysis of homicide rates in the Brazilian state of Goias from 2002 to 2014
Sousa, Samuel bruno da Silva, Del-Fiaco, Ronaldo de Castro, Berton, Lilian
Homicide mortality is a worldwide concern and has occupied the agenda of researchers and public managers. In Brazil, homicide is the third leading cause of death in the general population and the first in the 15-39 age group. In South America, Brazil has the third highest homicide mortality, behind Venezuela and Colombia. To measure the impacts of violence it is important to assess health systems and criminal justice, as well as other areas. In this paper, we analyze the spatial distribution of homicide mortality in the state of Goias, Center-West of Brazil, since the homicide rate increased from 24.5 per 100,000 in 2002 to 42.6 per 100,000 in 2014 in this location. Moreover, this state had the fifth position of homicides in Brazil in 2014. We considered socio-demographic variables for the state, performed analysis about correlation and employed three clustering algorithms: K-means, Density-based and Hierarchical. The results indicate the homicide rates are higher in cities neighbors of large urban centers, although these cities have the best socioeconomic indicators.
Exploiting Capacity of Sewer System Using Unsupervised Learning Algorithms Combined with Dimensionality Reduction
Zhang, Duo, Lindholm, Geir, Martinez, Nicolas, Ratnaweera, Harsha
Exploiting capacity of sewer system using decentralized control is a cost effective mean of minimizing the overflow. Given the size of the real sewer system, exploiting all the installed control structures in the sewer pipes can be challenging. This paper presents a divide and conquer solution to implement decentralized control measures based on unsupervised learning algorithms. A sewer system is first divided into a number of subcatchments. A series of natural and built factors that have the impact on sewer system performance is then collected. Clustering algorithms are then applied to grouping subcatchments with similar hydraulic hydrologic characteristics. Following which, principal component analysis is performed to interpret the main features of sub-catchment groups and identify priority control locations. Overflows under different control scenarios are compared based on the hydraulic model. Simulation results indicate that priority control applied to the most suitable cluster could bring the most profitable result.
CAPTAIN: Comprehensive Composition Assistance for Photo Taking
Farhat, Farshid, Kamani, Mohammad Mahdi, Wang, James Z.
Many people are interested in taking astonishing photos and sharing with others. Emerging hightech hardware and software facilitate ubiquitousness and functionality of digital photography. Because composition matters in photography, researchers have leveraged some common composition techniques to assess the aesthetic quality of photos computationally. However, composition techniques developed by professionals are far more diverse than well-documented techniques can cover. We leverage the vast underexplored innovations in photography for computational composition assistance. We propose a comprehensive framework, named CAPTAIN (Composition Assistance for Photo Taking), containing integrated deep-learned semantic detectors, sub-genre categorization, artistic pose clustering, personalized aesthetics-based image retrieval, and style set matching. The framework is backed by a large dataset crawled from a photo-sharing Website with mostly photography enthusiasts and professionals. The work proposes a sequence of steps that have not been explored in the past by researchers. The work addresses personal preferences for composition through presenting a ranked-list of photographs to the user based on user-specified weights in the similarity measure. The matching algorithm recognizes the best shot among a sequence of shots with respect to the user's preferred style set. We have conducted a number of experiments on the newly proposed components and reported findings. A user study demonstrates that the work is useful to those taking photos.