Goto

Collaborating Authors

 Clustering


Mint: MDL-based approach for Mining INTeresting Numerical Pattern Sets

arXiv.org Artificial Intelligence

Pattern mining is well established in data mining research, especially for mining binary datasets. Surprisingly, there is much less work about numerical pattern mining and this research area remains under-explored. In this paper, we propose Mint, an efficient MDL-based algorithm for mining numerical datasets. The MDL principle is a robust and reliable framework widely used in pattern mining, and as well in subgroup discovery. In Mint we reuse MDL for discovering useful patterns and returning a set of non-redundant overlapping patterns with well-defined boundaries and covering meaningful groups of objects. Mint is not alone in the category of numerical pattern miners based on MDL. In the experiments presented in the paper we show that Mint outperforms competitors among which Slim and RealKrimp.


A Data-Driven Study of Commonsense Knowledge using the ConceptNet Knowledge Base

arXiv.org Artificial Intelligence

Acquiring commonsense knowledge and reasoning is recognized as an important frontier in achieving general Artificial Intelligence (AI). Recent research in the Natural Language Processing (NLP) community has demonstrated significant progress in this problem setting. Despite this progress, which is mainly on multiple-choice question answering tasks in limited settings, there is still a lack of understanding (especially at scale) of the nature of commonsense knowledge itself. In this paper, we propose and conduct a systematic study to enable a deeper understanding of commonsense knowledge by doing an empirical and structural analysis of the ConceptNet knowledge base. ConceptNet is a freely available knowledge base containing millions of commonsense assertions presented in natural language. Detailed experimental results on three carefully designed research questions, using state-of-the-art unsupervised graph representation learning ('embedding') and clustering techniques, reveal deep substructures in ConceptNet relations, allowing us to make data-driven and computational claims about the meaning of phenomena such as 'context' that are traditionally discussed only in qualitative terms. Furthermore, our methodology provides a case study in how to use data-science and computational methodologies for understanding the nature of an everyday (yet complex) psychological phenomenon that is an essential feature of human intelligence.


A Temporal Neural Network Architecture for Online Learning

arXiv.org Artificial Intelligence

A long-standing proposition is that by emulating the operation of the brain's neocortex, a spiking neural network (SNN) can achieve similar desirable features: flexible learning, speed, and efficiency. Temporal neural networks (TNNs) are SNNs that communicate and process information encoded as relative spike times (in contrast to spike rates). A TNN architecture is proposed, and, as a proof-of-concept, TNN operation is demonstrated within the larger context of online supervised classification. First, through unsupervised learning, a TNN partitions input patterns into clusters based on similarity. The TNN then passes a cluster identifier to a simple online supervised decoder which finishes the classification task. The TNN learning process adjusts synaptic weights by using only signals local to each synapse, and clustering behavior emerges globally. The system architecture is described at an abstraction level analogous to the gate and register transfer levels in conventional digital design. Besides features of the overall architecture, several TNN components are new to this work. Although not addressed directly, the overall research objective is a direct hardware implementation of TNNs. Consequently, all the architecture elements are simple, and processing is done at very low precision. Importantly, low precision leads to very fast learning times. Simulation results using the time-honored MNIST dataset demonstrate learning times at least an order of magnitude faster than other online approaches while providing similar error rates.


Clustering with missing data: which equivalent for Rubin's rules?

arXiv.org Machine Learning

Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.


Comparative Analysis of Extreme Verification Latency Learning Algorithms

arXiv.org Artificial Intelligence

One of the more challenging real-world problems in computational intelligence is to learn from non-stationary streaming data, also known as concept drift. Perhaps even a more challenging version of this scenario is when -- following a small set of initial labeled data -- the data stream consists of unlabeled data only. Such a scenario is typically referred to as learning in initially labeled nonstationary environment, or simply as extreme verification latency (EVL). Because of the very challenging nature of the problem, very few algorithms have been proposed in the literature up to date. This work is a very first effort to provide a review of some of the existing algorithms (important/prominent) in this field to the research community. More specifically, this paper is a comprehensive survey and comparative analysis of some of the EVL algorithms to point out the weaknesses and strengths of different approaches from three different perspectives: classification accuracy, computational complexity and parameter sensitivity using several synthetic and real world datasets.


TabPy: Combining Python and Tableau - KDnuggets

#artificialintelligence

Can we integrate the power of Python calculation with a Tableau? That question was encourage me to start exploring the possibility of using Python calculation in Tableau, and I ended up with a TabPy. How can we use TabPy to integrating Python and Tableau? In this article, I will introduce TabPy and go through an example of how we can use it. TabPy is an Analytics Extension from Tableau which enables us as a user to execute Python scripts and saved functions using Tableau.


Airbnb Data Exploration with K-Means Clustering from Scratch

#artificialintelligence

Before we dive into the Airbnb dataset and our findings, let's do an in-depth review of the K-Means clustering algorithm. An unsupervised learner receives unlabeled training data and makes predictions for unseen points. Clustering analysis falls under unsupervised learning. A cluster is a collection of data objects that are similar (or related) to one another within the same group, or dissimilar (or unrelated) to the objects in other groups. "Good" clusters will have the following: high intra-class similarity (cohesiveness within clusters) and low inter-class similarity (distinctive between clusters). The K-Means algorithm is common a type of clustering analysis.


Automatic Clustering for Unsupervised Risk Diagnosis of Vehicle Driving for Smart Road

arXiv.org Artificial Intelligence

Early risk diagnosis and driving anomaly detection from vehicle stream are of great benefits in a range of advanced solutions towards Smart Road and crash prevention, although there are intrinsic challenges, especially lack of ground truth, definition of multiple risk exposures. This study proposes a domain-specific automatic clustering (termed Autocluster) to self-learn the optimal models for unsupervised risk assessment, which integrates key steps of risk clustering into an auto-optimisable pipeline, including feature and algorithm selection, hyperparameter auto-tuning. Firstly, based on surrogate conflict measures, indicator-guided feature extraction is conducted to construct temporal-spatial and kinematical risk features. Then we develop an elimination-based model reliance importance (EMRI) method to unsupervised-select the useful features. Secondly, we propose balanced Silhouette Index (bSI) to evaluate the internal quality of imbalanced clustering. A loss function is designed that considers the clustering performance in terms of internal quality, inter-cluster variation, and model stability. Thirdly, based on Bayesian optimisation, the algorithm selection and hyperparameter auto-tuning are self-learned to generate the best clustering partitions. Various algorithms are comprehensively investigated. Herein, NGSIM vehicle trajectory data is used for test-bedding. Findings show that Autocluster is reliable and promising to diagnose multiple distinct risk exposures inherent to generalised driving behaviour. Besides, we also delve into risk clustering, such as, algorithms heterogeneity, Silhouette analysis, hierarchical clustering flows, etc. Meanwhile, the Autocluster is also a method for unsupervised multi-risk data labelling and indicator threshold calibration. Furthermore, Autocluster is useful to tackle the challenges in imbalanced clustering without ground truth or priori knowledge



Double Self-weighted Multi-view Clustering via Adaptive View Fusion

arXiv.org Artificial Intelligence

Multi-view clustering has been applied in many real-world applications where original data often contain noises. Some graph-based multi-view clustering methods have been proposed to try to reduce the negative influence of noises. However, previous graph-based multi-view clustering methods treat all features equally even if there are redundant features or noises, which is obviously unreasonable. In this paper, we propose a novel multi-view clustering framework Double Self-weighted Multi-view Clustering (DSMC) to overcome the aforementioned deficiency. DSMC performs double self-weighted operations to remove redundant features and noises from each graph, thereby obtaining robust graphs. For the first self-weighted operation, it assigns different weights to different features by introducing an adaptive weight matrix, which can reinforce the role of the important features in the joint representation and make each graph robust. For the second self-weighting operation, it weights different graphs by imposing an adaptive weight factor, which can assign larger weights to more robust graphs. Furthermore, by designing an adaptive multiple graphs fusion, we can fuse the features in the different graphs to integrate these graphs for clustering. Experiments on six real-world datasets demonstrate its advantages over other state-of-the-art multi-view clustering methods.