Clustering
Vector Quantization as Sparse Least Square Optimization
Vector quantization aims to form new vectors/matrices with shared values close to the original. It could compress data with acceptable information loss, and could be of great usefulness in areas like Image Processing, Pattern Recognition and Machine Learning. In recent years, the importance of quantization has been soaring as it has been discovered huge potentials in deploying practical neural networks, which is among one of the most popular research topics. Conventional vector quantization methods usually suffer from their own flaws: hand-coding domain rules quantization could produce poor results when encountering complex data, and clustering-based algorithms have the problem of inexact solution and high time consumption. In this paper, we explored vector quantization problem from a new perspective of sparse least square optimization and designed multiple algorithms with their program implementations. Specifically, deriving from a sparse form of coefficient matrix, three types of sparse least squares, with $l_0$, $l_1$, and generalized $l_1 + l_2$ penalizations, are designed and implemented respectively. In addition, to produce quantization results with given amount of quantized values(instead of penalization coefficient $\lambda$), this paper proposed a cluster-based least square quantization method, which could also be regarded as an improvement of information preservation of conventional clustering algorithm. The algorithms were tested on various data and tasks and their computational properties were analyzed. The paper offers a new perspective to probe the area of vector quantization, while the algorithms proposed could provide more appropriate options for quantization tasks under different circumstances.
Real-Time Bidding with Multi-Agent Reinforcement Learning in Display Advertising
Jin, Junqi, Song, Chengru, Li, Han, Gai, Kun, Wang, Jun, Zhang, Weinan
Real-time advertising allows advertisers to bid for each impression for a visiting user. To optimize a specific goal such as maximizing the revenue led by ad placements, advertisers not only need to estimate the relevance between the ads and user's interests, but most importantly require a strategic response with respect to other advertisers bidding in the market. In this paper, we formulate bidding optimization with multi-agent reinforcement learning. To deal with a large number of advertisers, we propose a clustering method and assign each cluster with a strategic bidding agent. A practical Distributed Coordinated Multi-Agent Bidding (DCMAB) has been proposed and implemented to balance the tradeoff between the competition and cooperation among advertisers. The empirical study on our industry-scaled real-world data has demonstrated the effectiveness of our modeling methods. Our results show that a cluster based bidding would largely outperform single-agent and bandit approaches, and the coordinated bidding achieves better overall objectives than the purely self-interested bidding agents.
Behavioral-clinical phenotyping with type 2 diabetes self-monitoring data
Levine, Matthew E., Albers, David J., Burgermaster, Marissa, Davidson, Patricia G., Smaldone, Arlene M., Mamykina, Lena
Words: 4252 Keywords: self-monitoring data, type 2 diabetes, machine learning, phenotyping, precision medicine ABSTRACT Objective: To evaluate unsupervised clustering methods for identifying individual-level behavioral-clinical phenotypes that relate personal biomarkers and behavioral traits in type 2 diabetes (T2DM) self-monitoring data. Materials and Methods: We used hierarchical clustering (HC) to identify groups of meals with similar nutrition and glycemic impact for 6 individuals with T2DM who collected self-monitoring data. We evaluated clusters on: 1) correspondence to gold standards generated by certified diabetes educators (CDEs) for 3 participants; 2) face validity, rated by CDEs, and 3) impact on CDEs' ability to identify patterns for another 3 participants. Results: Gold standard (GS) included 9 patterns across 3 participants. Of these, all 9 were rediscovered using HC: 4 GS patterns were consistent with patterns identified by HC (over 50% of meals in a cluster followed the pattern); another 5 were included as subgroups in broader clusers. After reviewing clusters, CDEs identified patterns that were more consistent with data (70% reduction in contradictions between patterns and participants' records). Discussion: Hierarchical clustering of blood glucose and macronutrient consumption appears suitable for discovering behavioral-clinical phenotypes in T2DM. Most clusters corresponded to gold standard and were rated positively by CDEs for face validity. Cluster visualizations helped CDEs identify more robust patterns in nutrition and glycemic impact, creating new possibilities for visual analytic solutions. Conclusion: Machine learning methods can use diabetes self-monitoring data to create personalized behavioral-clinical phenotypes, which may prove useful for delivering personalized medicine.
Scalable and Robust Sparse Subspace Clustering Using Randomized Clustering and Multilayer Graphs
Abdolali, Maryam, Gillis, Nicolas, Rahmati, Mohammad
Sparse subspace clustering (SSC) is one of the current state-of-the-art methods for partitioning data points into the union of subspaces, with strong theoretical guarantees. However, it is not practical for large data sets as it requires solving a LASSO problem for each data point, where the number of variables in each LASSO problem is the number of data points. To improve the scalability of SSC, we propose to select a few sets of anchor points using a randomized hierarchical clustering method, and, for each set of anchor points, solve the LASSO problems for each data point allowing only anchor points to have a non-zero weight (this reduces drastically the number of variables). This generates a multilayer graph where each layer corresponds to a different set of anchor points. Using the Grassmann manifold of orthogonal matrices, the shared connectivity among the layers is summarized within a single subspace. Finally, we use $k$-means clustering within that subspace to cluster the data points, similarly as done by spectral clustering in SSC. We show on both synthetic and real-world data sets that the proposed method not only allows SSC to scale to large-scale data sets, but that it is also much more robust as it performs significantly better on noisy data and on data with close susbspaces and outliers, while it is not prone to oversegmentation.
An efficient $k$-means-type algorithm for clustering datasets with incomplete records
Lithio, Andrew, Maitra, Ranjan
The $k$-means algorithm is the most popular nonparametric clustering method in use, but cannot generally be applied to data sets with missing observations. The usual practice with such data sets is to either impute the values under an assumption of a missing-at-random mechanism or to ignore the incomplete records, and then to use the desired clustering method. We develop an efficient version of the $k$-means algorithm that allows for clustering cases where not all the features have observations recorded. Our extension is called $k_m$-means and reduces to the $k$-means algorithm when all records are complete. We also provide strategies to initialize our algorithm and to estimate the number of groups in the data set. Illustrations and simulations demonstrate the efficacy of our approach in a variety of settings and patterns of missing data. Our methods are also applied to the clustering of gamma-ray bursts and to the analysis of activation images obtained from a functional Magnetic Resonance Imaging experiment.
Spectrally approximating large graphs with smaller graphs
Loukas, Andreas, Vandergheynst, Pierre
How does coarsening affect the spectrum of a general graph? We provide conditions such that the principal eigenvalues and eigenspaces of a coarsened and original graph Laplacian matrices are close. The achieved approximation is shown to depend on standard graph-theoretic properties, such as the degree and eigenvalue distributions, as well as on the ratio between the coarsened and actual graph sizes. Our results carry implications for learning methods that utilize coarsening. For the particular case of spectral clustering, they imply that coarse eigenvectors can be used to derive good quality assignments even without refinement---this phenomenon was previously observed, but lacked formal justification.
One-Shot Coresets: The Case of k-Clustering
Bachem, Olivier, Lucic, Mario, Lattanzi, Silvio
Scaling clustering algorithms to massive data sets is a challenging task. Recently, several successful approaches based on data summarization methods, such as coresets and sketches, were proposed. While these techniques provide provably good and small summaries, they are inherently problem dependent -- the practitioner has to commit to a fixed clustering objective before even exploring the data. However, can one construct small data summaries for a wide range of clustering problems simultaneously? We affirmatively answer this question by proposing an efficient algorithm that constructs such one-shot summaries for a large family of k-clustering problems while retaining strong theoretical guarantees.
Tools for higher-order network analysis
Networks are a fundamental model of complex systems throughout the sciences, and network datasets are typically analyzed through lower-order connectivity patterns described at the level of individual nodes and edges. However, higher-order connectivity patterns captured by small subgraphs, also called network motifs, describe the fundamental structures that control and mediate the behavior of many complex systems. We develop three tools for network analysis that use higher-order connectivity patterns to gain new insights into network datasets: (1) a framework to cluster nodes into modules based on joint participation in network motifs; (2) a generalization of the clustering coefficient measurement to investigate higher-order closure patterns; and (3) a definition of network motifs for temporal networks and fast algorithms for counting them. Using these tools, we analyze data from biology, ecology, economics, neuroscience, online social networks, scientific collaborations, telecommunications, transportation, and the World Wide Web.
Must-Know: How to determine the most useful number of clusters?
Editor's note: This post was originally included as an answer to a question posed in our 17 More Must-Know Data Science Interview Questions and Answers series earlier this year. The answer was thorough enough that it was deemed to deserve its own dedicated post. With supervised learning, the number of classes in a particular set of data is known outright, since each data instance in labeled as a member of a particular existent class. In the worst case, we can scan the class attribute and count up the number of unique entries which exist. With unsupervised learning, the idea of class attributes and explicit class membership does not exist; in fact, one of the dominant forms of unsupervised learning -- data clustering -- aims to approximate class membership by minimizing interclass instance similarity and maximizing intraclass similarity.
Unsupervised vehicle recognition using incremental reseeding of acoustic signatures
Sunu, Justin, Hunter, Blake, Percus, Allon G.
Vehicle recognition and classification have broad applications, ranging from traffic flow management to military target identification. We demonstrate an unsupervised method for automated identification of moving vehicles from roadside audio sensors. Using a short-time Fourier transform to decompose audio signals, we treat the frequency signature in each time window as an individual data point. We then use a spectral embedding for dimensionality reduction. Based on the leading eigenvectors, we relate the performance of an incremental reseeding algorithm to that of spectral clustering. We find that incremental reseeding accurately identifies individual vehicles using their acoustic signatures.