Goto

Collaborating Authors

 Clustering


Application of Clustering Analysis for Investigation of Food Accessibility

arXiv.org Machine Learning

Access to food assistance programs such as food pantries and food banks needs focus in order to mitigate food insecurity. Accessibility to the food assistance programs is impacted by demographics of the population and geography of the location. It hence becomes imperative to define and identify food assistance deserts (Under-served areas) within a given region to find out the ways to improve the accessibility of food. Food banks, the supplier of food to the food agencies serving the people, can manage its resources more efficiently by targeting the food assistance deserts and increase the food supply in those regions. This paper will examine the characteristics and structure of the food assistance network in the region of Ohio by presenting the possible reasons of food insecurity in this region and identify areas wherein food agencies are needed or may not be needed. Gaussian Mixture Model (GMM) clustering technique is employed to identify the possible reasons and address this problem of food accessibility.


Scalable Deep Unsupervised Clustering with Concrete GMVAEs

arXiv.org Machine Learning

Discrete random variables are natural components of probabilistic clustering models. A number of VAE variants with discrete latent variables have been developed. Training such methods requires marginalizing over the discrete latent variables, causing training time complexity to be linear in the number clusters. By applying a continuous relaxation to the discrete variables in these methods we can achieve a reduction in the training time complexity to be constant in the number of clusters used. We demonstrate that in practice for one such method, the Gaussian Mixture VAE, the use of a continuous relaxation has no negative effect on the quality of the clustering but provides a substantial reduction in training time, reducing training time on CIFAR-100 with 20 clusters from 47 hours to less than 6 hours.


On Efficient Multilevel Clustering via Wasserstein Distances

arXiv.org Machine Learning

We propose a novel approach to the problem of multilevel clustering, which aims to simultaneously partition data in each group and discover grouping patterns among groups in a potentially large hierarchically structured corpus of data. Our method involves a joint optimization formulation over several spaces of discrete probability measures, which are endowed with Wasserstein distance metrics. We propose several variants of this problem, which admit fast optimization algorithms, by exploiting the connection to the problem of finding Wasserstein barycenters. Consistency properties are established for the estimates of both local and global clusters. Finally, the experimental results with both synthetic and real data are presented to demonstrate the flexibility and scalability of the proposed approach.


DAOC: Stable Clustering of Large Networks

arXiv.org Machine Learning

--Clustering is a crucial component of many data mining systems involving the analysis and exploration of various data. Data diversity calls for clustering algorithms to be accurate while providing stable (i.e., deterministic and robust) results on arbitrary input networks. Moreover, modern systems often operate with large datasets, which implicitly constrains the complexity of the clustering algorithm. Existing clustering techniques are only partially stable, however, as they guarantee either determinism or robustness. T o address this issue, we introduce DAOC, a Deterministic and Agglomerative Overlapping Clustering algorithm. DAOC leverages a new technique called Overlap Decomposition to identify fine-grained clusters in a deterministic way capturing multiple optima. In addition, it leverages a novel consensus approach, Mutual Maximal Gain, to ensure robustness and further improve the stability of the results while still being capable of identifying micro-scale clusters. Our empirical results on both synthetic and real-world networks show that DAOC yields stable clusters while being on average 25% more accurate than state-of-the-art deterministic algorithms without requiring any tuning. Our approach has the ambition to greatly simplify and speed up data analysis tasks involving iterative processing (need for determinism) as well as data fluctuations (need for robustness) and to provide accurate and reproducible results. Clustering is a fundamental part of data mining with a wide applicability to statistical analysis and exploration of physical, social, biological and informational systems.


Global Optimal Path-Based Clustering Algorithm

arXiv.org Machine Learning

Abstract--Combinatorial optimization problems for clustering are known to be NPhard. Most optimization methods are not able to find the global optimum solution for all datasets. T o solve this problem, we propose a global optimal path-based clustering (GOPC) algorithm in this paper. The GOPC algorithm is based on two facts: (1) medoids have the minimum degree in their clusters; (2) the minimax distance between two objects in one cluster is smaller than the minimax distance between objects in different clusters. Extensive experiments are conducted on synthetic and real-world datasets to evaluate the performance of the GOPC algorithm. The results on synthetic datasets show that the GOPC algorithm can recognize all kinds of clusters regardless of their shapes, sizes, or densities. Experimental results on real-world datasets demonstrate the effectiveness and efficiency of the GOPC algorithm. In addition, the GOPC algorithm needs only one parameter, i.e., the number of clusters, which can be estimated by the decision graph. The advantages mentioned above make GOPC a good candidate as a general clustering algorithm. In clustering algorithms, measuring the dissimilarity between any pair of points is very important. The most commonly used dissimilarity method is Euclidean distance. However, in many real-world applications of pattern classification and data mining, we are often confronted with high-dimensional features of the investigated data, which adversely affects clustering performance due to the curse of dimensionality [9], [10]. It is widely acknowledged that many real-world datasets stringently obey low-rank rules, which means that they are distributed on a manifold of a dimensionality that is often much lower than that of ambient space [11], [12], [13].


Conformal Prediction based Spectral Clustering

arXiv.org Machine Learning

Spectral Clustering(SC) is a prominent data clustering technique of recent times which has attracted much attention from researchers. It is a highly data-driven method and makes no strict assumptions on the structure of the data to be clustered. One of the central pieces of spectral clustering is the construction of an affinity matrix based on a similarity measure between data points. The way the similarity measure is defined between data points has a direct impact on the performance of the SC technique. Several attempts have been made in the direction of strengthening the pairwise similarity measure to enhance the spectral clustering. In this work, we have defined a novel affinity measure by employing the concept of non-conformity used in Conformal Prediction(CP) framework. The non-conformity based affinity captures the relationship between neighborhoods of data points and has the power to generalize the notion of contextual similarity. We have shown that this formulation of affinity measure gives good results and compares well with the state of the art methods.


Unsupervised Learning with Clustering Techniques w/Srini Anand

#artificialintelligence

As humans we are able to discern differences among different groups within a collection. We might group a collection by broad groups such as birds versus plants versus animals or detect subtle features to identify different makes and models of cars. Clustering techniques allow us to automate the process and apply them to data where groupings are not immediately obvious. These techniques are used for different purposes such as detecting market segments, identifying properties of online communities, fraud detection, and cybersecurity. Srini Anand is a Data Scientist at Ameritas Life Insurance Company and holds a Masters degree in Data Science from Indiana University.


Online k-means Clustering

arXiv.org Machine Learning

We study the problem of online clustering where a clustering algorithm has to assign a new point that arrives to one of $k$ clusters. The specific formulation we use is the $k$-means objective: At each time step the algorithm has to maintain a set of k candidate centers and the loss incurred is the squared distance between the new point and the closest center. The goal is to minimize regret with respect to the best solution to the $k$-means objective ($\mathcal{C}$) in hindsight. We show that provided the data lies in a bounded region, an implementation of the Multiplicative Weights Update Algorithm (MWUA) using a discretized grid achieves a regret bound of $\tilde{O}(\sqrt{T})$ in expectation. We also present an online-to-offline reduction that shows that an efficient no-regret online algorithm (despite being allowed to choose a different set of candidate centres at each round) implies an offline efficient algorithm for the $k$-means problem. In light of this hardness, we consider the slightly weaker requirement of comparing regret with respect to $(1 + \epsilon) \mathcal{C}$ and present a no-regret algorithm with runtime $O\left(T(\mathrm{poly}(log(T),k,d,1/\epsilon)^{k(d+O(1))}\right)$. Our algorithm is based on maintaining an incremental coreset and an adaptive variant of the MWUA. We show that na\"{i}ve online algorithms, such as \emph{Follow The Leader}, fail to produce sublinear regret in the worst case. We also report preliminary experiments with synthetic and real-world data.


Multi-graph Fusion for Multi-view Spectral Clustering

arXiv.org Machine Learning

For example, a person can be uniquely identified in terms of face, fingerprint, iris, and signature; an image can be described by different kinds of descriptors: SIFT, HOG, and LBP, where SIFT is robust to image illumination, noise, and rotation, HOG is sensitive to marginal information, while LBP is a powerful texture feature; the same document can be represented in different languages. Different views can capture distinct perspectives of data. Numerous real-world applications have benefited from multi-view data by leveraging the complementary information [5, 6, 7, 8, 9]. Thus, multi-view learning has become an important research field [10, 11]. As an important ingredient of multi-view learning, multi-view clustering has been widely investigated to identify underlying structures in multi-view data in an unsupervised way [12, 13]. Although each view contains different fractional information, they together admit the same clustering structure. Simply concatenating all features into a single view and then employing a clustering algorithm on this single view data might not obtain better performance than traditional methods which use single view separately [14, 11]. In the past decade, plenty of advanced multi-view clustering algorithms have been proposed and they perform effectively by considering the diversity and complementarity of different views.


Authorship Analysis as a Text Classification or Clustering Problem

#artificialintelligence

Many such'literary' quandaries are inspected by expert linguists as analysing and categorising discourses is fairly complex, domain-specific and highly multi-dimensional. One of latest research areas in Natural Language Processing is Authorship Analysis which is trying to leverage the computational power of big-data and artificial intelligence combined with linguistics and cognitive psychology to encode automatic classification of texts, identification of author profiles and resolution of authorship conflicts. This article is an attempt to introduce the concept of authorship analysis, its application areas and the major sub-tasks associated with it. The art and science of discriminating between writing styles of authors by identifying the characteristics of the persona of the authors and examining articles authored by them is called Authorship Analysis. Consequentially, it also aims to determine biographic characteristics of an individual like age, gender, native language and cognitive psychological traits based on "available information" pertaining to that individual. In this article, "available information" refers to textual data only in the context of authorship analysis, however, information in this context could go beyond textual format as it might also involve usage of multi-modal observations.