Clustering
Towards Intelligent Vehicular Networks: A Machine Learning Framework
Liang, Le, Ye, Hao, Li, Geoffrey Ye
As wireless networks evolve towards high mobility and providing better support for connected vehicles, a number of new challenges arise due to the resulting high dynamics in vehicular environments and thus motive rethinking of traditional wireless design methodologies. Future intelligent vehicles, which are at the heart of high mobility networks, are increasingly equipped with multiple advanced onboard sensors and keep generating large volumes of data. Machine learning, as an effective approach to artificial intelligence, can provide a rich set of tools to exploit such data for the benefit of the networks. In this article, we first identify the distinctive characteristics of high mobility vehicular networks and motivate the use of machine learning to address the resulting challenges. After a brief introduction of the major concepts of machine learning, we discuss its applications to learn the dynamics of vehicular networks and make informed decisions to optimize network performance. In particular, we discuss in greater detail the application of reinforcement learning in managing network resources as an alternative to the prevalent optimization approach. Finally, some open issues worth further investigation are highlighted.
Mode-Seeking Clustering and Density Ridge Estimation via Direct Estimation of Density-Derivative-Ratios
Sasaki, Hiroaki, Kanamori, Takafumi, Hyvรคrinen, Aapo, Niu, Gang, Sugiyama, Masashi
Modes and ridges of the probability density function behind observed data are useful geometric features. Mode-seeking clustering assigns cluster labels by associating data samples with the nearest modes, and estimation of density ridges enables us to find lower-dimensional structures hidden in data. A key technical challenge both in mode-seeking clustering and density ridge estimation is accurate estimation of the ratios of the first- and second-order density derivatives to the density. A naive approach takes a three-step approach of first estimating the data density, then computing its derivatives, and finally taking their ratios. However, this three-step approach can be unreliable because a good density estimator does not necessarily mean a good density derivative estimator, and division by the estimated density could significantly magnify the estimation error. To cope with these problems, we propose a novel estimator for the \emph{density-derivative-ratios}. The proposed estimator does not involve density estimation, but rather \emph{directly} approximates the ratios of density derivatives of any order. Moreover, we establish a convergence rate of the proposed estimator. Based on the proposed estimator, novel methods both for mode-seeking clustering and density ridge estimation are developed, and the respective convergence rates to the mode and ridge of the underlying density are also established. Finally, we experimentally demonstrate that the developed methods significantly outperform existing methods, particularly for relatively high-dimensional data.
On Hyperparameter Search in Cluster Ensembles
Helfmann, Luzie, von Lindheim, Johannes, Mollenhauer, Mattes, Banisch, Ralf
Quality assessments of models in unsupervised learning and clustering verification in particular have been a long-standing problem in the machine learning research. The lack of robust and universally applicable cluster validity scores often makes the algorithm selection and hyperparameter evaluation a tough guess. In this paper, we show that cluster ensemble aggregation techniques such as consensus clustering may be used to evaluate clusterings and their hyperparameter configurations. We use normalized mutual information to compare individual objects of a clustering ensemble to the constructed consensus of the whole ensemble and show, that the resulting score can serve as an overall quality measure for clustering problems. This method is capable of highlighting the standout clustering and hyperparameter configuration in the ensemble even in the case of a distorted consensus. We apply this very general framework to various data sets and give possible directions for future research.
COBRAS: Fast, Iterative, Active Clustering with Pairwise Constraints
Van Craenendonck, Toon, Dumanฤiฤ, Sebastijan, Van Wolputte, Elia, Blockeel, Hendrik
Constraint-based clustering algorithms exploit background knowledge to construct clusterings that are aligned with the interests of a particular user. This background knowledge is often obtained by allowing the clustering system to pose pairwise queries to the user: should these two elements be in the same cluster or not? Active clustering methods aim to minimize the number of queries needed to obtain a good clustering by querying the most informative pairs first. Ideally, a user should be able to answer a couple of these queries, inspect the resulting clustering, and repeat these two steps until a satisfactory result is obtained. We present COBRAS, an approach to active clustering with pairwise constraints that is suited for such an interactive clustering process. A core concept in COBRAS is that of a super-instance: a local region in the data in which all instances are assumed to belong to the same cluster. COBRAS constructs such super-instances in a top-down manner to produce high-quality results early on in the clustering process, and keeps refining these super-instances as more pairwise queries are given to get more detailed clusterings later on. We experimentally demonstrate that COBRAS produces good clusterings at fast run times, making it an excellent candidate for the iterative clustering scenario outlined above.
Active Metric Learning for Supervised Classification
Kumaran, Krishnan, Papageorgiou, Dimitri, Chang, Yutong, Li, Minhan, Takรกฤ, Martin
Clustering and classification critically rely on distance metrics that provide meaningful comparisons between data points. We present mixed-integer optimization approaches to find optimal distance metrics that generalize the Mahalanobis metric extensively studied in the literature. Additionally, we generalize and improve upon leading methods by removing reliance on pre-designated "target neighbors," "triplets," and "similarity pairs." Another salient feature of our method is its ability to enable active learning by recommending precise regions to sample after an optimal metric is computed to improve classification performance. This targeted acquisition can significantly reduce computational burden by ensuring training data completeness, representativeness, and economy. We demonstrate classification and computational performance of the algorithms through several simple and intuitive examples, followed by results on real image and medical datasets.
Extending churn analysis to revenue forecasting using R
Here, train_cust_summary_temp is the prepared training data, clusters_used is the parameter with the number of clusters to use in the model, all_methods [] is a vector holding the available methods, and cluster_method is the particular method used on this pass. The cluster_model variable is a list which contains information on cluster membership, the cluster centers, and other information.
Volkswagen Refining Machine Learning on D-Wave System
Researchers at Volkswagen have been at the cutting edge of implementing D-Wave quantum computers for a number of complex optimization problems, including traffic flow optimization, among other potential use cases. These efforts are generally focused on developing algorithms suitable for the company's recently purchased 2000-qubit quantum system and have expanded to a range of new machine learning possibilities, including what a research team at the company's U.S. R&D office and the Volkswagen Data:Lab in Munich are calling quantum-assisted cluster analysis. The art and science of clustering is well known for machine learning on classical computing architectures, but the VW approach that has been tailored to perform well on a quantum processor has undergone major alterations. As the team describes, to be mapped to the D-Wave architectures, the problem has to be expressed as a quadratic unconstrained binary optimization problem. Once done, the accuracy of results is similar to the what a classical clustering algorithm can produce.
A Study of Clustering Techniques and Hierarchical Matrix Formats for Kernel Ridge Regression
Rebrova, Elizaveta, Chavez, Gustavo, Liu, Yang, Ghysels, Pieter, Li, Xiaoye Sherry
We present memory-efficient and scalable algorithms for kernel methods used in machine learning. Using hierarchical matrix approximations for the kernel matrix the memory requirements, the number of floating point operations, and the execution time are drastically reduced compared to standard dense linear algebra routines. We consider both the general $\mathcal{H}$ matrix hierarchical format as well as Hierarchically Semi-Separable (HSS) matrices. Furthermore, we investigate the impact of several preprocessing and clustering techniques on the hierarchical matrix compression. Effective clustering of the input leads to a ten-fold increase in efficiency of the compression. The algorithms are implemented using the STRUMPACK solver library. These results confirm that --- with correct tuning of the hyperparameters --- classification using kernel ridge regression with the compressed matrix does not lose prediction accuracy compared to the exact --- not compressed --- kernel matrix and that our approach can be extended to $\mathcal{O}(1M)$ datasets, for which computation with the full kernel matrix becomes prohibitively expensive. We present numerical experiments in a distributed memory environment up to 1,024 processors of the NERSC's Cori supercomputer using well-known datasets to the machine learning community that range from dimension 8 up to 784.
A high-bias, low-variance introduction to Machine Learning for physicists
Mehta, Pankaj, Bukov, Marin, Wang, Ching-Hao, Day, Alexandre G. R., Richardson, Clint, Fisher, Charles K., Schwab, David J.
Machine Learning (ML) is one of the most exciting and dynamic areas of modern research and application. The purpose of this review is to provide an introduction to the core concepts and tools of machine learning in a manner easily understood and intuitive to physicists. The review begins by covering fundamental concepts in ML and modern statistics such as the bias-variance tradeoff, overfitting, regularization, and generalization before moving on to more advanced topics in both supervised and unsupervised learning. Topics covered in the review include ensemble models, deep learning and neural networks, clustering and data visualization, energy-based models (including MaxEnt models and Restricted Boltzmann Machines), and variational methods. Throughout, we emphasize the many natural connections between ML and statistical physics. A notable aspect of the review is the use of Python notebooks to introduce modern ML/statistical packages to readers using physics-inspired datasets (the Ising Model and Monte-Carlo simulations of supersymmetric decays of proton-proton collisions). We conclude with an extended outlook discussing possible uses of machine learning for furthering our understanding of the physical world as well as open problems in ML where physicists maybe able to contribute. (Notebooks are available at https://physics.bu.edu/~pankajm/MLnotebooks.html )
Clustering to Reduce Spatial Data Set Size
Traditionally it had been a problem that researchers did not have access to enough spatial data to answer pressing research questions or build compelling visualizations. Today, however, the problem is often that we have too much data. Spatially redundant or approximately redundant points may refer to a single feature (plus noise) rather than many distinct spatial features. We can use density-based clustering to compress such spatial data into a set of representative features. This paper demonstrates how to reduce the size of a spatial data set of GPS latitude-longitude coordinates using the Python programming language and its scikitlearn implementation of the DBSCAN density-based clustering algorithm. DBSCAN works very well in low-dimension space, such as the two-dimensional feature space in this geospatial example.