Clustering
Mining Human Mobility Data to Discover Locations and Habits
Andrade, Thiago, Cancela, Brais, Gama, João
Many aspects of life are associated with places of human mobility patterns and nowadays we are facing an increase in the pervasiveness of mobile devices these individuals carry. Positioning technologies that serve these devices such as the cellular antenna (GSM networks), global navigation satellite systems (GPS), and more recently the WiFi positioning system (WPS) provide large amounts of spatio-temporal data in a continuous way. Therefore, detecting significant places and the frequency of movements between them is fundamental to understand human behavior. In this paper, we propose a method for discovering user habits without any a priori or external knowledge by introducing a density-based clustering for spatio-temporal data to identify meaningful places and by applying a Gaussian Mixture Model (GMM) over the set of meaningful places to identify the representations of individual habits. To evaluate the proposed method we use two real-world datasets. One dataset contains high-density GPS data and the other one contains GSM mobile phone data in a coarse representation. The results show that the proposed method is suitable for this task as many unique habits were identified. This can be used for understanding users' behavior and to draw their characterizing profiles having a panorama of the mobility patterns from the data.
Determining offshore wind installation times using machine learning and open data
Tranberg, Bo, Kratmann, Kasper Koops, Stege, Jason
The installation process of offshore wind turbines requires the use of expensive jack-up vessels. These vessels regularly report their position via the Automatic Identification System (AIS). This paper introduces a novel approach of applying machine learning to AIS data from jack-up vessels. We apply the new method to 13 offshore wind farms in Danish, German and British waters. For each of the wind farms we identify individual turbine locations, individual installation times, time in transit and time in harbor for the respective vessel. This is done in an automated way exclusively using AIS data with no prior knowledge of turbine locations, thus enabling a detailed description of the entire installation process.
kjahan/clustering
This implementation programmatically optimizes for the number of clusters (k) and at the end of clustering process stores the clusters to disk. You can test the code with San Francisco crimes data in "inputs" folder (i.e. Note that if you want to test with your own location data, you need to copy your location CSV format file into "inputs" folder first. Next, pass your filename as a parameter to the clustering program as shown below. Your CSV file should have "Lat,Lon" format.
No Free Lunch But A Cheaper Supper: A General Framework for Streaming Anomaly Detection
Calikus, Ece, Nowaczyk, Slawomir, Sant'Anna, Anita, Dikmen, Onur
In recent years, there has been increased research interest in detecting anomalies in temporal streaming data. A variety of algorithms have been developed in the data mining community, which can be divided into two categories (i.e., general and ad hoc). In most cases, general approaches assume the one-size-fits-all solution model where a single anomaly detector can detect all anomalies in any domain. To date, there exists no single general method that has been shown to outperform the others across different anomaly types, use cases and datasets. On the other hand, ad hoc approaches that are designed for a specific application lack flexibility. Adapting an existing algorithm is not straightforward if the specific constraints or requirements for the existing task change. In this paper, we propose SAFARI, a general framework formulated by abstracting and unifying the fundamental tasks in streaming anomaly detection, which provides a flexible and extensible anomaly detection procedure. SAFARI helps to facilitate more elaborate algorithm comparisons by allowing us to isolate the effects of shared and unique characteristics of different algorithms on detection performance. Using SAFARI, we have implemented various anomaly detectors and identified a research gap that motivates us to propose a novel learning strategy in this work. We conducted an extensive evaluation study of 20 detectors that are composed using SAFARI and compared their performances using real-world benchmark datasets with different properties. The results indicate that there is no single superior detector that works well for every case, proving our hypothesis that "there is no free lunch" in the streaming anomaly detection world. Finally, we discuss the benefits and drawbacks of each method in-depth and draw a set of conclusions to guide future users of SAFARI.
57 Best Machine Learning Course Online & Tutorial Digital Learning Land
Data visualization: In this section, you will learn how to create simple plots like scatter plot histogram bar, etc. Data manipulation: You will learn in detail about data manipulation. GUI Programming: This section is a combination of life instructor-led training and self-paced learning. Developing web Maps and representing information using plots: In this section, you will understand how to design Python applications. Computer vision using open CV and visualization using bokeh: You will also learn designing Python application in the section.
Minimal Learning Machine: Theoretical Results and Clustering-Based Reference Point Selection
Hämäläinen, Joonas, Alencar, Alisson S. C., Kärkkäinen, Tommi, Mattos, César L. C., Júnior, Amauri H. Souza, Gomes, João P. P.
The Minimal Learning Machine (MLM) is a nonlinear supervised approach based on learning a linear mapping between distance matrices computed in the input and output data spaces, where distances are calculated concerning a subset of points called reference points. Its simple formulation has attracted several recent works on extensions and applications. In this paper, we aim to address some open questions related to the MLM. First, we detail theoretical aspects that assure the interpolation and universal approximation capabilities of the MLM, which were previously only empirically verified. Second, we identify the task of selecting reference points as having major importance for the MLM's generalization capability; furthermore, we assess several clustering-based methods in regression scenarios. Based on an extensive empirical evaluation, we conclude that the evaluated methods are both scalable and useful. Specifically, for a small number of reference points, the clustering-based methods outperformed the standard random selection of the original MLM formulation.
An Investigation of Quantum Deep Clustering Framework with Quantum Deep SVM & Convolutional Neural Network Feature Extractor
Bishwas, Arit Kumar, Mani, Ashish, Palade, Vasile
In this paper, we have proposed a deep quantum SVM formulation, and further demonstrated a quantum-clustering framework based on the quantum deep SVM formulation, deep convolutional neural networks, and quantum K-Means clustering. We have investigated the run time computational complexity of the proposed quantum deep clustering framework and compared with the possible classical implementation. Our investigation shows that the proposed quantum version of deep clustering formulation demonstrates a significant performance gain (exponential speed up gains in many sections) against the possible classical implementation. The proposed theoretical quantum deep clustering framework is also interesting & novel research towards the quantum-classical machine learning formulation to articulate the maximum performance.
Application of Fuzzy Clustering for Text Data Dimensionality Reduction
Large textual corpora are often represented by the document-term frequency matrix whose elements are the frequency of terms; however, this matrix has two problems: sparsity and high dimensionality. Four dimension reduction strategies are used to address these problems. Of the four strategies, unsupervised feature transformation (UFT) is a popular and efficient strategy to map the terms to a new basis in the document-term frequency matrix. Although several UFT-based methods have been developed, fuzzy clustering has not been considered for dimensionality reduction. This research explores fuzzy clustering as a new UFT-based approach to create a lower-dimensional representation of documents. Performance of fuzzy clustering with and without using global term weighting methods is shown to exceed principal component analysis and singular value decomposition. This study also explores the effect of applying different fuzzifier values on fuzzy clustering for dimensionality reduction purpose.
Online Hierarchical Clustering Approximations
Menon, Aditya Krishna, Rajagopalan, Anand, Sumengen, Baris, Citovsky, Gui, Cao, Qin, Kumar, Sanjiv
Hierarchical clustering is a widely used approach for clustering datasets at multiple levels of granularity. Despite its popularity, existing algorithms such as hierarchical agglomerative clustering (HAC) are limited to the offline setting, and thus require the entire dataset to be available. This prohibits their use on large datasets commonly encountered in modern learning applications. In this paper, we consider hierarchical clustering in the online setting, where points arrive one at a time. We propose two algorithms that seek to optimize the Moseley and Wang (MW) revenue function, a variant of the Dasgupta cost. These algorithms offer different tradeoffs between efficiency and MW revenue performance. The first algorithm, OTD, is a highly efficient Online Top Down algorithm which provably achieves a 1/3-approximation to the MW revenue under a data separation assumption. The second algorithm, OHAC, is an online counterpart to offline HAC, which is known to yield a 1/3-approximation to the MW revenue, and produce good quality clusters in practice. We show that OHAC approximates offline HAC by leveraging a novel split-merge procedure. We empirically show that OTD and OHAC offer significant efficiency and cluster quality gains respectively over baselines.
Consensual aggregation of clusters based on Bregman divergences to improve predictive models
Fisher, Aurélie, Has, Sothea, Mougeot, Mathilde
A new procedure to construct predictive models in supervised learning problems by paying attention to the clustering structure of the input data is introduced. We are interested in situations where the input data consists of more than one unknown cluster, and where there exist different underlying models on these clusters. Thus, instead of constructing a single predictive model on the whole dataset, we propose to use a K-means clustering algorithm with different options of Bregman divergences, to recover the clustering structure of the input data. Then one dedicated predictive model is fit per cluster. For each divergence, we construct a simple local predictor on each observed cluster. We obtain one estimator, the collection of the K simple local predictors, per divergence, and we propose to combine them in a smart way based on a consensus idea. Several versions of consensual aggregation in both classification and regression problems are considered. A comparison of the performances of all constructed estimators on different simulated and real data assesses the excellent performance of our method. In a large variety of prediction problems, the consensual aggregation procedure outperforms all the other models.