Clustering
Clustering piecewise stationary processes
Khaleghi, Azadeh, Ryabko, Daniil
Clustering, in an informal sense, involves breaking a dataset into possibly disjoint subsets called clusters where the elements within the same cluster are somehow more similar to each other than to those in other clusters. This task, which is often a first step in data-analysis, is meant to help with the initial steps to making sense of the data that typically have complex structures and represent some unknown underlying phenomena to be inferred. Given the nature of the problem, it is desirable to make as little assumptions as possible about the underlying mechanisms generating the data. Moreover, the minimal assumptions made should ideally be qualitative and easily verifiable from an application's perspective. In this paper we consider a subclass of the clustering problem where each data-point is a time series. Indeed, such sequential data are ubiquitous in modern applications involving, for example, user-behaviour, social networks, as well as financial or biological data, where the observations are sequential by nature, and/or are collected over time. The common features in these real-world datasets are the absence of precise models as well as an abundance of data. From a mathematical perspective, a learning problem involving sequential data can be formulated as follows. Given sequences of the form Y,...,Y the aim is to make inference
From Multi-modal Property Dataset to Robot-centric Conceptual Knowledge About Household Objects
Thosar, Madhura, Mueller, Christian A., Jaeger, Georg, Schleiss, Johannes, Pulugu, Narender, Chennaboina, Ravi Mallikarjun, Jeevangekar, Sai Vivek, Birk, Andreas, Pfingsthorn, Max, Zug, Sebastian
Tool-use applications in robotics require conceptual knowledge about objects for informed decision making and object interactions. State-of-the-art methods employ hand-crafted symbolic knowledge which is defined from a human perspective and grounded into sensory data afterwards. However, due to different sensing and acting capabilities of robots, their conceptual understanding of objects must be generated from a robot's perspective entirely, which asks for robot-centric conceptual knowledge about objects. With this goal in mind, this article motivates that such knowledge should be based on physical and functional properties of objects. Consequently, a selection of ten properties is defined and corresponding extraction methods are proposed. This multi-modal property extraction forms the basis on which our second contribution, a robot-centric knowledge generation is build on. It employs unsupervised clustering methods to transform numerical property data into symbols, and Bivariate Joint Frequency Distributions and Sample Proportion to generate conceptual knowledge about objects using the robot-centric symbols. A preliminary implementation of the proposed framework is employed to acquire a dataset comprising physical and functional property data of 110 houshold objects. This Robot-Centric dataSet (RoCS) is used to evaluate the framework regarding the property extraction methods, the semantics of the considered properties within the dataset and its usefulness in real-world applications such as tool substitution.
Versatile linkage: a family of space-conserving strategies for agglomerative hierarchical clustering
Agglomerative hierarchical clustering can be implemented with several strategies that differ in the way elements of a collection are grouped together to build a hierarchy of clusters. Here we introduce versatile linkage, a new infinite system of agglomerative hierarchical clustering strategies based on generalized means, which go from single linkage to complete linkage, passing through arithmetic average linkage and other clustering methods yet unexplored such as geometric linkage and harmonic linkage. We compare the different clustering strategies in terms of cophenetic correlation, mean absolute error, and also tree balance and space distortion, two new measures proposed to describe hierarchical trees. Unlike the β-flexible clustering system, we show that the versatile linkage family is space-conserving.
Data-driven prediction of vortex-induced vibration response of marine risers subjected to three-dimensional current
Riemer-Sørensen, Signe, Wu, Jie, Lie, Halvor, Sævik, Svein, Kim, Sang-Woo
Slender marine structures such as deep-water marine risers are subjected to currents and will normally experience Vortex Induced Vibrations (VIV), which can cause fast accumulation of fatigue damage. The ocean current is often three-dimensional (3D), i.e., the direction and magnitude of the current vary throughout the water column. Today, semi-empirical tools are used by the industry to predict VIV induced fatigue on risers. The load model and hydrodynamic parameters in present VIV prediction tools are developed based on two-dimensional (2D) flow conditions, as it is challenging to consider the effect of 3D flow along the risers. Accordingly, the current profiles must be purposely made 2D during the design process, which leads to significant uncertainty in the prediction results. Further, due to the limitations in the laboratory, VIV model tests are mostly carried out under 2D flow conditions and thus little experimental data exist to document VIV response of riser subjected to varying directions of the current. However, a few experiments have been conducted with 3D current. We have used results from one of these experiments to investigate how well 1) traditional and 2) an alternative method based on a data driven prediction can describe VIV in 3D currents. Data driven modelling is particularly suited for complicated problems with many parameters and non-linear relationships. We have applied a data clustering algorithm to the experimental 3D flow data in order to identify measurable parameters that can influence responses. The riser responses are grouped based on their statistical characteristics, which relate to the direction of the flow. Furthermore we fit a random forest regression model to the measured VIV response and compare its performance with the predictions of existing VIV prediction tools (VIVANA-FD).
Density-based Clustering with Best-scored Random Forest
Hang, Hanyuan, Cai, Yuchao, Yang, Hanfang
Regarded as one of the most basic tools to investigate statistical properties of unsupervised data, clustering aims to group a set of objects in such a way that objects in the same cluster are more similar in some sense to each other than to those in other clusters. Typical application possibilities are to be found reaching from categorization of tissues in medical imaging to grouping internet searching results. For instance, on PET scans, cluster analysis can distinguish between different types of tissue in a three-dimensional image for many different purposes (Filipovych et al., 2011) while in the process of intelligent grouping of the files and websites, clustering algorithms create a more relevant set of search results (Marco and Navigli, 2013). Because of their wide applications, more urgent requirements for clustering algorithms that not only maintain desirable prediction accuracy but also have high computational efficiency are raised.
DLIME: A Deterministic Local Interpretable Model-Agnostic Explanations Approach for Computer-Aided Diagnosis Systems
Zafar, Muhammad Rehman, Khan, Naimul Mefraz
Local Interpretable Model-Agnostic Explanations (LIME) is a popular technique used to increase the interpretability and explainability of black box Machine Learning (ML) algorithms. LIME typically generates an explanation for a single prediction by any ML model by learning a simpler interpretable model (e.g. linear classifier) around the prediction through generating simulated data around the instance by random perturbation, and obtaining feature importance through applying some form of feature selection. While LIME and similar local algorithms have gained popularity due to their simplicity, the random perturbation and feature selection methods result in "instability" in the generated explanations, where for the same prediction, different explanations can be generated. This is a critical issue that can prevent deployment of LIME in a Computer-Aided Diagnosis (CAD) system, where stability is of utmost importance to earn the trust of medical professionals. In this paper, we propose a deterministic version of LIME. Instead of random perturbation, we utilize agglomerative Hierarchical Clustering (HC) to group the training data together and K-Nearest Neighbour (KNN) to select the relevant cluster of the new instance that is being explained. After finding the relevant cluster, a linear model is trained over the selected cluster to generate the explanations. Experimental results on three different medical datasets show the superiority for Deterministic Local Interpretable Model-Agnostic Explanations (DLIME), where we quantitatively determine the stability of DLIME compared to LIME utilizing the Jaccard similarity among multiple generated explanations.
Learning with fuzzy hypergraphs: a topical approach to query-oriented text summarization
Van Lierde, Hadrien, Chow, Tommy W. S.
Existing graph-based methods for extractive document summarization represent sentences of a corpus as the nodes of a graph or a hypergraph in which edges depict relationships of lexical similarity between sentences. Such approaches fail to capture semantic similarities between sentences when they express a similar information but have few words in common and are thus lexically dissimilar. To overcome this issue, we propose to extract semantic similarities based on topical representations of sentences. Inspired by the Hierarchical Dirichlet Process, we propose a probabilistic topic model in order to infer topic distributions of sentences. As each topic defines a semantic connection among a group of sentences with a certain degree of membership for each sentence, we propose a fuzzy hypergraph model in which nodes are sentences and fuzzy hyperedges are topics. To produce an informative summary, we extract a set of sentences from the corpus by simultaneously maximizing their relevance to a user-defined query, their centrality in the fuzzy hypergraph and their coverage of topics present in the corpus. We formulate a polynomial time algorithm building on the theory of submodular functions to solve the associated optimization problem. A thorough comparative analysis with other graph-based summarization systems is included in the paper. Our obtained results show the superiority of our method in terms of content coverage of the summaries.
Flattening a Hierarchical Clustering through Active Learning
Gentile, Claudio, Vitale, Fabio, Rajagopalan, Anand
We investigate active learning by pairwise similarity over the leaves of trees originating from hierarchical clustering procedures. In the realizable setting, we provide a full characterization of the number of queries needed to achieve perfect reconstruction of the tree cut. In the non-realizable setting, we rely on known important-sampling procedures to obtain regret and query complexity bounds. Our algorithms come with theoretical guarantees on the statistical error and, more importantly, lend themselves to linear-time implementations in the relevant parameters of the problem. We discuss such implementations, prove running time guarantees for them, and present preliminary experiments on real-world datasets showing the compelling practical performance of our algorithms as compared to both passive learning and simple active learning baselines.
Versatile linkage: a family of space-conserving strategies for agglomerative hierarchical clustering
Fernández, Alberto, Gómez, Sergio
Abstract: Agglomerative hierarchical clustering can be implemented with several strategies that differ in the way elements of a collection are grouped together to build a hierarchy of clusters. Here we introduce versatile linkage, a new infinite system of agglomerative hierarchical clustering strategies based on generalized means, which go from single linkage to complete linkage, passing through arithmetic average linkage and other clustering methods yet unexplored such as geometric linkage and harmonic linkage. We compare the different clustering strategies in terms of cophenetic correlation, mean absolute error, and also tree balance and space distortion, two new measures proposed to describe hierarchical trees. Unlike the β-flexible clustering system, we show that the versatile linkage family is space-conserving.
Customer Segmentation of Wireless Trajectory Data
Karlsen, Matthew R, Moschoyiannis, Sotiris K.
Wireless trajectory data consists of a number of (time, point) entries where each point is associated with a particular wireless device (WAP or BLE beacon) tied to a location identifier, such as a place name. A trajectory relates to a particular mobile device. Such data can be clustered `semantically' to identify similar trajectories, where similarity relates to non-geographic characteristics such as the type of location visited. Here we present a new approach to semantic trajectory clustering for such data. The approach is applicable to interpreting data that does not contain geographical coordinates, and thus contributes to the current literature on semantic trajectory clustering. The literature does not appear to provide such an approach, instead focusing on trajectory data where latitude and longitude data is available. We apply the techniques developed above in the context of the Onward Journey Planner Application, with the motivation of providing on-line recommendations for onward journey options in a context-specific manner. The trajectories analysed indicate commute patterns on the London Underground. Points are only recorded for communication with WAP and BLE beacons within the rail network. This context presents additional challenge since the trajectories are `truncated', with no true origin and destination details. In the above context we find that there are a range of travel patterns in the data, without the existence of distinct clusters. Suggestions are made concerning how to approach the problem of provision of on-line recommendations with such a data set. Thoughts concerning the related problem of prediction of journey route and destination are also provided.