Clustering
Collaborative Learning and Personalization in Multi-Agent Stochastic Linear Bandits
Ghosh, Avishek, Sankararaman, Abishek, Ramchandran, Kannan
We consider the problem of minimizing regret in an $N$ agent heterogeneous stochastic linear bandits framework, where the agents (users) are similar but not all identical. We model user heterogeneity using two popularly used ideas in practice; (i) A clustering framework where users are partitioned into groups with users in the same group being identical to each other, but different across groups, and (ii) a personalization framework where no two users are necessarily identical, but a user's parameters are close to that of the population average. In the clustered users' setup, we propose a novel algorithm, based on successive refinement of cluster identities and regret minimization. We show that, for any agent, the regret scales as $\mathcal{O}(\sqrt{T/N})$, if the agent is in a `well separated' cluster, or scales as $\mathcal{O}(T^{\frac{1}{2} + \varepsilon}/(N)^{\frac{1}{2} -\varepsilon})$ if its cluster is not well separated, where $\varepsilon$ is positive and arbitrarily close to $0$. Our algorithm is adaptive to the cluster separation, and is parameter free -- it does not need to know the number of clusters, separation and cluster size, yet the regret guarantee adapts to the inherent complexity. In the personalization framework, we introduce a natural algorithm where, the personal bandit instances are initialized with the estimates of the global average model. We show that, an agent $i$ whose parameter deviates from the population average by $\epsilon_i$, attains a regret scaling of $\widetilde{O}(\epsilon_i\sqrt{T})$. This demonstrates that if the user representations are close (small $\epsilon_i)$, the resulting regret is low, and vice-versa. The results are empirically validated and we observe superior performance of our adaptive algorithms over non-adaptive baselines.
Clustering Geolocation Data in Python using DBSCAN and K-Means
Clustering is a technique of dividing the population or data points, grouping them into different clusters on the basis of similarity and dissimilarity between them. It's helps in determining the intrinsic group among the unlabeled data points. In this project we will be using Taxi dataset ( can be downloaded from Kaggle) and perform clustering Geolocation Data using K-Means and demostrate how to use DBSCAN Density-Based Spatial Clustering of Applications with Noise (DBSCAN) which discovers clusters of different shapes and sizes from data containing noise and outliers and HDBSCAN -- Hierarchical Density-Based Spatial Clustering of Applications with Noise which performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. Folium makes it easy to visualize data that's been manipulated in Python on an interactive leaflet map. It enables both the binding of data to a map for choropleth visualizations as well as passing rich vector/raster/HTML visualizations as markers on the map. The library has a number of built-in tilesets from OpenStreetMap, Mapbox, and Stamen, and supports custom tilesets with Mapbox or Cloudmade API keys.
Multiple Dynamic Pricing for Demand Response with Adaptive Clustering-based Customer Segmentation in Smart Grids
Meng, Fanlin, Ma, Qian, Liu, Zixu, Zeng, Xiao-Jun
In this paper, we propose a realistic multiple dynamic pricing approach to demand response in the retail market. First, an adaptive clustering-based customer segmentation framework is proposed to categorize customers into different groups to enable the effective identification of usage patterns. Second, customized demand models with important market constraints which capture the price-demand relationship explicitly, are developed for each group of customers to improve the model accuracy and enable meaningful pricing. Third, the multiple pricing based demand response is formulated as a profit maximization problem subject to realistic market constraints. The overall aim of the proposed scalable and practical method aims to achieve 'right' prices for 'right' customers so as to benefit various stakeholders in the system such as grid operators, customers and retailers. The proposed multiple pricing framework is evaluated via simulations based on real-world datasets.
Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time
Dhulipala, Laxman, Eisenstat, David, ลฤ cki, Jakub, Mirrokni, Vahab, Shi, Jessica
We study the widely used hierarchical agglomerative clustering (HAC) algorithm on edge-weighted graphs. We define an algorithmic framework for hierarchical agglomerative graph clustering that provides the first efficient $\tilde{O}(m)$ time exact algorithms for classic linkage measures, such as complete- and WPGMA-linkage, as well as other measures. Furthermore, for average-linkage, arguably the most popular variant of HAC, we provide an algorithm that runs in $\tilde{O}(n\sqrt{m})$ time. For this variant, this is the first exact algorithm that runs in subquadratic time, as long as $m=n^{2-\epsilon}$ for some constant $\epsilon > 0$. We complement this result with a simple $\epsilon$-close approximation algorithm for average-linkage in our framework that runs in $\tilde{O}(m)$ time. As an application of our algorithms, we consider clustering points in a metric space by first using $k$-NN to generate a graph from the point set, and then running our algorithms on the resulting weighted graph. We validate the performance of our algorithms on publicly available datasets, and show that our approach can speed up clustering of point datasets by a factor of 20.7--76.5x.
A Deep Variational Approach to Clustering Survival Data
Manduchi, Laura, Marcinkeviฤs, Riฤards, Massi, Michela C., Gotta, Verena, Mรผller, Timothy, Vasella, Flavio, Neidert, Marian C., Pfister, Marc, Vogt, Julia E.
Survival analysis has gained significant attention in the medical domain and has many far-reaching applications. Although a variety of machine learning methods have been introduced for tackling time-to-event prediction in unstructured data with complex dependencies, clustering of survival data remains an under-explored problem. The latter is particularly helpful in discovering patient subpopulations whose survival is regulated by different generative mechanisms, a critical problem in precision medicine. To this end, we introduce a novel probabilistic approach to cluster survival data in a variational deep clustering setting. Our proposed method employs a deep generative model to uncover the underlying distribution of both the explanatory variables and the potentially censored survival times. We compare our model to the related work on survival clustering in comprehensive experiments on a range of synthetic, semi-synthetic, and real-world datasets. Our proposed method performs better at identifying clusters and is competitive at predicting survival times in terms of the concordance index and relative absolute error. To further demonstrate the usefulness of our approach, we show that our method identifies meaningful clusters from an observational cohort of hemodialysis patients that are consistent with previous clinical findings.
Swarm Intelligence for Self-Organized Clustering
Thrun, Michael C., Ultsch, Alfred
Algorithms implementing populations of agents which interact with one another and sense their environment may exhibit emergent behavior such as self-organization and swarm intelligence. Here a swarm system, called Databionic swarm (DBS), is introduced which is able to adapt itself to structures of high-dimensional data characterized by distance and/or density-based structures in the data space. By exploiting the interrelations of swarm intelligence, self-organization and emergence, DBS serves as an alternative approach to the optimization of a global objective function in the task of clustering. The swarm omits the usage of a global objective function and is parameter-free because it searches for the Nash equilibrium during its annealing process. To our knowledge, DBS is the first swarm combining these approaches. Its clustering can outperform common clustering methods such as K-means, PAM, single linkage, spectral clustering, model-based clustering, and Ward, if no prior knowledge about the data is available. A central problem in clustering is the correct estimation of the number of clusters. This is addressed by a DBS visualization called topographic map which allows assessing the number of clusters. It is known that all clustering algorithms construct clusters, irrespective of the data set contains clusters or not. In contrast to most other clustering algorithms, the topographic map identifies, that clustering of the data is meaningless if the data contains no (natural) clusters. The performance of DBS is demonstrated on a set of benchmark data, which are constructed to pose difficult clustering problems and in two real-world applications.
Gaussian Mixture Estimation from Weighted Samples
Frisch, Daniel, Hanebeck, Uwe D.
Given a set of samples, the parameters of a GM are determined in such a way as to best fit the samples in a maximum likelihood way. Solutions for equally weighted samples are readily available, expectation-maximization (EM) based methods being the most prevalent because of low computational requirements and ease of implementation. So it comes as a surprise that GM estimation for weighted samples is hard to find in literature. It might be even more surprising that the standard reference [1] gives incorrect results, see Figure 1. 2. Context Applications for sample-to-density function approximation include clustering of unlabled data [2, 3], multi-target tracking [4, 5], group tracking [6], multilateration [7, 8], and arbitrary density representation in nonlinear filters [9, 10]. A popular basic solution to this is the k-means algorithm. It does not find a complete density representation, only the means of the individual clusters. The k-means algorithm uses hard sample-tomean associations, therefore yields merely approximate solutions but can be computationally optimized using k-d trees [11, 12]. Moreover, the global optimum can be found deterministically [13], therefore it can be used to provide an initial guess for more elaborate algorithms. A sample-to-density approximation that is optimal in a maximum likelihood sense can be searched with numerical optimization techniques such as the Newton algorithm that has quadratic convergence but high computational demand per iteration, quasi-Newton methods, the method of scoring, or the conjugate gradient method with slower convergence but less computational effort per iteration [14].
A 2020 taxonomy of algorithms inspired on living beings behavior
Since the emerge of ideas about simulation of life in last decades, several algorithms have been proposed to solve complex problems inspired on nature phenomena; i.e. evolutionary computation or artificial life. A role of a naturalist or biologist is taken with the purpose for studying all living forms in a new ecosystem and trying to make a classification of all discoveries to form a taxonomy of living beings. This role is taken as a computer naturalist to make a compilation of algorithms inspired on behavior of living beings. There are several bio-inspired algorithms; however, this work focus on actions of living beings like the growth of plants, reproduction of mushrooms, living of bacteria, the individuals behavior of animals, etc.; however, highlights the interactions between individuals of a group of different animals like school of fishes, flock of birds, herd of mammals, or swarm of insects. Focusing on algorithms inspired in actions of living beings that belongs to any kingdom of the nature; nevertheless, it is important to locate all algorithms as possible. Only basic algorithms are considered, but derivations, variants and hybrids are omitted; at least, algorithms which involves an inspiration of any living being. Location of bio-inspired algorithms related with a specific species is made by a review of several papers of surveys which involve nature bio-inspired, swarm intelligence, and metaheuristics algorithms; however, several of these surveys consider different points of view. It was consider only survey papers from ten years old ago because it is expected a more complete reviews since then. Surveys span in many cases all kind of algorithms; however many of them have been proposed recently; it maybe because the year 2020 is iconic.
Weighted Sparse Subspace Representation: A Unified Framework for Subspace Clustering, Constrained Clustering, and Active Learning
Peng, Hankui, Pavlidis, Nicos G.
In many challenging real-world applications involving the grouping of high-dimensional data, different clusters can be well approximated as lower dimensional subspaces. This is the case for example in gene sequencing (McWilliams and Montana, 2014), face clustering (Elhamifar and Vidal, 2013), motion segmentation (Rao et al., 2010), and text mining (Peng et al., 2018). The problem of simultaneously estimating the subspace corresponding to each cluster and partitioning a group of points into clusters according to these subspaces is called subspace clustering (Vidal, 2011). Spectral methods for subspace clustering have demonstrated excellent performance in numerous real-world applications (Liu et al., 2012; Lu et al., 2012; Elhamifar and Vidal, 2013; Li and Vidal, 2015; Huang et al., 2015). These methods construct an affinity matrix for spectral clustering by solving an optimisation problem that aims to approximate each point through a linear combination of other points from the same subspace.
Learning without Knowing: Unobserved Context in Continuous Transfer Reinforcement Learning
Liu, Chenyu, Zhang, Yan, Shen, Yi, Zavlanos, Michael M.
In this paper, we consider a transfer Reinforcement Learning (RL) problem in continuous state and action spaces, under unobserved contextual information. For example, the context can represent the mental view of the world that an expert agent has formed through past interactions with this world. We assume that this context is not accessible to a learner agent who can only observe the expert data. Then, our goal is to use the context-aware expert data to learn an optimal context-unaware policy for the learner using only a few new data samples. Such problems are typically solved using imitation learning that assumes that both the expert and learner agents have access to the same information. However, if the learner does not know the expert context, using the expert data alone will result in a biased learner policy and will require many new data samples to improve. To address this challenge, in this paper, we formulate the learning problem as a causal bound-constrained Multi-Armed-Bandit (MAB) problem. The arms of this MAB correspond to a set of basis policy functions that can be initialized in an unsupervised way using the expert data and represent the different expert behaviors affected by the unobserved context. On the other hand, the MAB constraints correspond to causal bounds on the accumulated rewards of these basis policy functions that we also compute from the expert data. The solution to this MAB allows the learner agent to select the best basis policy and improve it online. And the use of causal bounds reduces the exploration variance and, therefore, improves the learning rate. We provide numerical experiments on an autonomous driving example that show that our proposed transfer RL method improves the learner's policy faster compared to existing imitation learning methods and enjoys much lower variance during training.