Clustering
Machine Learning in R & Predictive Models
My course will be your complete guide to the theory and applications of supervised & unsupervised machine learning and predictive modeling using the R-programming language. Unlike other courses, it offers NOT ONLY the guided demonstrations of the R-scripts but also covers theoretical background that will allow you to FULLY UNDERSTAND & APPLY MACHINE LEARNING & PREDICTIVE MODELS (K-means, Random Forest, SVM, logistic regression, etc) in R (many R packages incl. This course also covers all the main aspects of practical and highly applied data science related to Machine Learning (classification & regressions) and unsupervised clustering techniques. Thus, if you take this course, you will save lots of time & money on other expensive materials in the R based Data Science and Machine Learning domain. In this age of big data, companies across the globe use R to analyze big volumes of data for business and research.
The Pursuit of Knowledge: Discovering and Localizing Novel Categories using Dual Memory
Rambhatla, Sai Saketh, Chellappa, Rama, Shrivastava, Abhinav
We tackle object category discovery, which is the problem of discovering and localizing novel objects in a large unlabeled dataset. While existing methods show results on datasets with less cluttered scenes and fewer object instances per image, we present our results on the challenging COCO dataset. Moreover, we argue that, rather than discovering new categories from scratch, discovery algorithms can benefit from identifying what is already known and focusing their attention on the unknown. We propose a method to use prior knowledge about certain object categories to discover new categories by leveraging two memory modules, namely Working and Semantic memory. We show the performance of our detector on the COCO minival dataset to demonstrate its in-the-wild capabilities.
Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques
This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years. Such models are inspired by the observation that when children pick up a language, they rely on a wide range of indirect and noisy clues, crucially including signals from the visual modality co-occurring with spoken utterances. Several fields have made important contributions to this approach to modeling or mimicking the process of learning language: Machine Learning, Natural Language and Speech Processing, Computer Vision and Cognitive Science. The current paper brings together these contributions in order to provide a useful introduction and overview for practitioners in all these areas. We discuss the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work. We then summarize the main modeling architectures and offer an exhaustive overview of the evaluation metrics and analysis techniques.
Learning for Detecting Norm Violation in Online Communities
Santos, Thiago Freitas dos, Osman, Nardine, Schorlemmer, Marco
In this paper, we focus on normative systems for online communities. The paper addresses the issue that arises when different community members interpret these norms in different ways, possibly leading to unexpected behavior in interactions, usually with norm violations that affect the individual and community experiences. To address this issue, we propose a framework capable of detecting norm violations and providing the violator with information about the features of their action that makes this action violate a norm. We build our framework using Machine Learning, with Logistic Model Trees as the classification algorithm. Since norm violations can be highly contextual, we train our model using data from the Wikipedia online community, namely data on Wikipedia edits. Our work is then evaluated with the Wikipedia use case where we focus on the norm that prohibits vandalism in Wikipedia edits.
Seeing All From a Few: Nodes Selection Using Graph Pooling for Graph Clustering
Wang, Yiming, Chang, Dongxia, Fu, Zhiqian, Zhao, Yao
Graph clustering aiming to obtain a partition of data using the graph information, has received considerable attention in recent years. However, noisy edges and nodes in the graph may make the clustering results worse. In this paper, we propose a novel dual graph embedding network(DGEN) to improve the robustness of the graph clustering to the noisy nodes and edges. DGEN is designed as a two-step graph encoder connected by a graph pooling layer, which learns the graph embedding of the selected nodes. Based on the assumption that a node and its nearest neighbors should belong to the same cluster, we devise the neighbor cluster pooling(NCPool) to select the most informative subset of vertices based on the clustering assignments of nodes and their nearest neighbor. This can effectively alleviate the impact of the noise edge to the clustering. After obtaining the clustering assignments of the selected nodes, a classifier is trained using these selected nodes and the final clustering assignments for all the nodes can be obtained by this classifier. Experiments on three benchmark graph datasets demonstrate the superiority compared with several state-of-the-art algorithms.
Performance evaluation results of evolutionary clustering algorithm star for clustering heterogeneous datasets
Hassan, Bryar A., Rashid, TarikA., Mirjalili, Seyedali
This article presents the data used to evaluate the performance of evolutionary clustering algorithm star (ECA*) compared to five traditional and modern clustering algorithms. Two experimental methods are employed to examine the performance of ECA* against genetic algorithm for clustering++ (GENCLUST++), learning vector quantisation (LVQ) , expectation maximisation (EM) , K-means++ (KM++) and K-means (KM). These algorithms are applied to 32 heterogenous and multi-featured datasets to determine which one performs well on the three tests. For one, ther paper examines the efficiency of ECA* in contradiction of its corresponding algorithms using clustering evaluation measures. These validation criteria are objective function and cluster quality measures. For another, it suggests a performance rating framework to measurethe the performance sensitivity of these algorithms on varos dataset features (cluster dimensionality, number of clusters, cluster overlap, cluster shape and cluster structure). The contributions of these experiments are two-folds: (i) ECA* exceeds its counterpart aloriths in ability to find out the right cluster number; (ii) ECA* is less sensitive towards dataset features compared to its competitive techniques. Nonetheless, the results of the experiments performed demonstrate some limitations in the ECA*: (i) ECA* is not fully applied based on the premise that no prior knowledge exists; (ii) Adapting and utilising ECA* on several real applications has not been achieved yet.
Flattening Multiparameter Hierarchical Clustering Functors
We bring together topological data analysis, applied category theory, and machine learning to study multiparameter hierarchical clustering. We begin by introducing a procedure for flattening multiparameter hierarchical clusterings. We demonstrate that this procedure is a functor from a category of multiparameter hierarchical partitions to a category of binary integer programs. We also include empirical results demonstrating its effectiveness. Next, we introduce a Bayesian update algorithm for learning clustering parameters from data. We demonstrate that the composition of this algorithm with our flattening procedure satisfies a consistency property.
Extending Isolation Forest for Anomaly Detection in Big Data via K-Means
Laskar, Md Tahmid Rahman, Huang, Jimmy, Smetana, Vladan, Stewart, Chris, Pouw, Kees, An, Aijun, Chan, Stephen, Liu, Lei
Industrial Information Technology (IT) infrastructures are often vulnerable to cyberattacks. To ensure security to the computer systems in an industrial environment, it is required to build effective intrusion detection systems to monitor the cyber-physical systems (e.g., computer networks) in the industry for malicious activities. This paper aims to build such intrusion detection systems to protect the computer networks from cyberattacks. More specifically, we propose a novel unsupervised machine learning approach that combines the K-Means algorithm with the Isolation Forest for anomaly detection in industrial big data scenarios. Since our objective is to build the intrusion detection system for the big data scenario in the industrial domain, we utilize the Apache Spark framework to implement our proposed model which was trained in large network traffic data (about 123 million instances of network traffic) stored in Elasticsearch. Moreover, we evaluate our proposed model on the live streaming data and find that our proposed system can be used for real-time anomaly detection in the industrial setup. In addition, we address different challenges that we face while training our model on large datasets and explicitly describe how these issues were resolved. Based on our empirical evaluation in different use-cases for anomaly detection in real-world network traffic data, we observe that our proposed system is effective to detect anomalies in big data scenarios. Finally, we evaluate our proposed model on several academic datasets to compare with other models and find that it provides comparable performance with other state-of-the-art approaches.
Phenotyping OSA: a time series analysis using fuzzy clustering and persistent homology
Loliencar, Prachi, Heo, Giseon
Sleep apnea is a disorder that has serious consequences for the pediatric population. There has been recent concern that traditional diagnosis of the disorder using the apnea-hypopnea index may be ineffective in capturing its multi-faceted outcomes. In this work, we take a first step in addressing this issue by phenotyping patients using a clustering analysis of airflow time series. This is approached in three ways: using feature-based fuzzy clustering in the time and frequency domains, and using persistent homology to study the signal from a topological perspective. The fuzzy clusters are analyzed in a novel manner using a Dirichlet regression analysis, while the topological approach leverages Takens embedding theorem to study the periodicity properties of the signals.
Bridging observation, theory and numerical simulation of the ocean using Machine Learning
Sonnewald, Maike, Lguensat, Redouane, Jones, Daniel C., Dueben, Peter D., Brajard, Julien, Balaji, Venkatramani
Progress within physical oceanography has been concurrent with the increasing sophistication of tools available for its study. The incorporation of machine learning (ML) techniques offers exciting possibilities for advancing the capacity and speed of established methods and also for making substantial and serendipitous discoveries. Beyond vast amounts of complex data ubiquitous in many modern scientific fields, the study of the ocean poses a combination of unique challenges that ML can help address. The observational data available is largely spatially sparse, limited to the surface, and with few time series spanning more than a handful of decades. Important timescales span seconds to millennia, with strong scale interactions and numerical modelling efforts complicated by details such as coastlines. This review covers the current scientific insight offered by applying ML and points to where there is imminent potential. We cover the main three branches of the field: observations, theory, and numerical modelling. Highlighting both challenges and opportunities, we discuss both the historical context and salient ML tools. We focus on the use of ML in situ sampling and satellite observations, and the extent to which ML applications can advance theoretical oceanographic exploration, as well as aid numerical simulations. Applications that are also covered include model error and bias correction and current and potential use within data assimilation. While not without risk, there is great interest in the potential benefits of oceanographic ML applications; this review caters to this interest within the research community.