AITopics

Driving behaviour has a great impact on road safety. A popular way of analysing driving behaviour is to move the focus to the manoeuvres as they give useful information about the driver who is performing them. In this paper, we investigate a new way of identifying manoeuvres from vehicle telematics data, through motif detection in time-series. We implement a modified version of the Extended Motif Discovery (EMD) algorithm, a classical variable-length motif detection algorithm for time-series and we applied it to the UAH-DriveSet, a publicly available naturalistic driving dataset. After a systematic exploration of the extracted motifs, we were able to conclude that the EMD algorithm was not only capable of extracting simple manoeuvres such as accelerations, brakes and curves, but also more complex manoeuvres, such as lane changes and overtaking manoeuvres, which validates motif discovery as a worthwhile line for future research.

algorithm, manoeuvre, motif, (13 more...)

doi: 10.1016/j.aap.2020.105467

2002.04127

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Portugal > Lisbon > Lisbon (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > Spain > Galicia > Madrid (0.04)

Genre: Research Report (0.82)

Industry: Transportation > Ground > Road (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Community Detection on Mixture Multi-layer Networks via Regularized Tensor Decomposition

Jing, Bing-Yi, Li, Ting, Lyu, Zhongyuan, Xia, Dong

We study the problem of community detection in multi-layer networks, where pairs of nodes can be related in multiple modalities. We introduce a general framework, i.e., mixture multi-layer stochastic block model (MMSBM), which includes many earlier models as special cases. We propose a tensor-based algorithm (TWIST) to reveal both global/local memberships of nodes, and memberships of layers. We show that the TWIST procedure can accurately detect the communities with small misclassification error as the number of nodes and/or the number of layers increases. Numerical studies confirm our theoretical findings. To our best knowledge, this is the first systematic study on the mixture multi-layer networks using tensor decomposition. The method is applied to two real datasets: worldwide trading networks and malaria parasite genes networks, yielding new and interesting findings.

community structure, log 2, probability, (13 more...)

2002.04457

Country:

Asia > Middle East > Qatar (0.14)
Asia > Middle East > Oman (0.14)
Europe > Switzerland (0.04)
(90 more...)

Genre: Research Report > New Finding (0.45)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Brunet-Saumard, Camille, Genetay, Edouard, Saumard, Adrien

K-bMOM: a robust Lloyd-type clustering algorithm based on bootstrap Median-of-Means

Data scientists have nowadays to deal with massive and complex datasets, that are often corrupted by outliers. Classical data mining procedures such as K-means or more general EM algorithms for instance are however sensitive to the presence of outliers, which can induce a time consuming pre-processing of the data. In this context, robust versions of data mining procedures are particularly relevant and we investigate a way to produce a Lloyd-type algorithm for hard clustering that is robust to the presence of ouliers. To do this, we propose to use a variant of median-of-means (MOM) statistics, that we call bootstrap median-of-means (bMOM). MOM principle has been the object of recent active research in mean estimation, regression, highdimensional framework and also supervised classification and machine learning ([17, 9, 15, 16, 19, 18, 20, 22]). Note that other approaches to robustness for K-means exist in the literature, such as for instance K-median or trimmed K-means (see for instance the survey [10] and references therein; see also [5]). Given a dataset, the boostrap median-of-means consists in first generating a (large) bootstrap sample and then perform a classical median-of-means on this bootstrap sample. We prove in Section 2 that if enough blocks are generated from the bootstrap sampling, then for a fixed block size, bMOM has a higher breakdown point than MOM.

algorithm, breakdown point, outlier, (15 more...)

2002.03899

Country:

North America > United States > New Jersey > Hudson County > Hoboken (0.04)
Europe > France > Brittany > Ille-et-Vilaine > Rennes (0.04)
North America > United States > New York (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Coppola, Craigory, Elgazzar, Heba

Novel Machine Learning Algorithms for Centrality and Cliques Detection in Youtube Social Networks

The goal of this research project is to analyze the dynamics of social networks using machine learning techniques to locate maximal cliques and to find clusters for the purpose of identifying a target demographic. Unsupervised machine learning techniques are designed and implemented in this project to analyze a dataset from YouTube to discover communities in the social network and find central nodes. Different clustering algorithms are implemented and applied to the YouTube dataset. The well-known Bron-Kerbosch algorithm is used effectively in this research to find maximal cliques. The results obtained from this research could be used for advertising purposes and for building smart recommendation systems. All algorithms were implemented using Python programming language. The experimental results show that we were able to successfully find central nodes through clique-centrality and degree centrality. By utilizing clique detection algorithms, the research shown how machine learning algorithms can detect close knit groups within a larger network.

centrality, dataset, node, (13 more...)

2002.03893

Country:

North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
North America > United States > Arizona (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Services (0.92)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Richard, Guillaume, Grossin, Benoît, Germaine, Guillaume, Hébrail, Georges, de Moliner, Anne

Autoencoder-based time series clustering with energy applications

Time series clustering is a challenging task due to the specific nature of the data. Classical approaches do not perform well and need to be adapted either through a new distance measure or a data transformation. In this paper we investigate the combination of a convolutional autoencoder and a k-medoids algorithm to perfom time series clustering. The convolutional autoencoder allows to extract meaningful features and reduce the dimension of the data, leading to an improvement of the subsequent clustering. Using simulation and energy related data to validate the approach, experimental results show that the clustering is robust to outliers thus leading to finer clusters than with standard methods.

autoencoder, outlier, time sery, (14 more...)

2002.03624

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Mexico > Yucatán > Mérida (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

A fast and efficient Modal EM algorithm for Gaussian mixtures

Scrucca, Luca

In the modal approach to clustering, clusters are defined as the local maxima of the underlying probability density function, where the latter can be estimated either non-parametrically or using finite mixture models. Thus, clusters are closely related to certain regions around the density modes, and every cluster corresponds to a bump of the density. The Modal EM algorithm is an iterative procedure that can identify the local maxima of any density function. In this contribution, we propose a fast and efficient Modal EM algorithm to be used when the density function is estimated through a finite mixture of Gaussian distributions with parsimonious component-covariance structures. After describing the procedure, we apply the proposed Modal EM algorithm on both simulated and real data examples, showing its high flexibility in several contexts.

algorithm, gaussian mixture, mem algorithm, (16 more...)

2002.036

Country:

Europe > Austria > Vienna (0.14)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Italy > Umbria > Perugia Province > Perugia (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

arXiv.org Artificial IntelligenceFeb-9-2020

Fair Correlation Clustering

Ahmadi, Saba, Galhotra, Sainyam, Saha, Barna, Schwartz, Roy

In this paper we study the problem of correlation clustering under fairness constraints. In the classic correlation clustering problem, we are given a complete graph where each edge is labeled positive or negative. The goal is to obtain a clustering of the vertices that minimizes disagreements -- the number of negative edges trapped inside a cluster plus positive edges between different clusters. We consider two variations of fairness constraint for the problem of correlation clustering where each node has a color, and the goal is to form clusters that do not over-represent vertices of any color. The first variant aims to generate clusters with minimum disagreements, where the distribution of a feature (e.g. gender) in each cluster is same as the global distribution. For the case of two colors when the desired ratio of the number of colors in each cluster is $1:p$, we get $\mathcal{O}(p^2)$-approximation algorithm. Our algorithm could be extended to the case of multiple colors. We prove this problem is NP-hard. The second variant considers relative upper and lower bounds on the number of nodes of any color in a cluster. The goal is to avoid violating upper and lower bounds corresponding to each color in each cluster while minimizing the total number of disagreements. Along with our theoretical results, we show the effectiveness of our algorithm to generate fair clusters by empirical evaluation on real world data sets.

algorithm, disagreement, node, (17 more...)

arXiv.org Artificial Intelligence

2002.03508

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > Maryland (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

arXiv.org Machine LearningFeb-8-2020

Improving S&P stock prediction with time series stock similarity

Sidi, Lior

Stock market prediction with forecasting algorithms is a popular topic these days where most of the forecasting algorithms train only on data collected on a particular stock. In this paper, we enriched the stock data with related stocks just as a professional trader would have done to improve the stock prediction models. We tested five different similarities functions and found co-integration similarity to have the best improvement on the prediction model. We evaluate the models on seven S&P stocks from various industries over five years period. The prediction model we trained on similar stocks had significantly better results with 0.55 mean accuracy, and 19.782 profit compare to the state of the art model with an accuracy of 0.52 and profit of 6.6.

configuration, prediction, similarity, (16 more...)

2002.05784

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.68)

arXiv.org Machine LearningFeb-7-2020

Fast Kernel k-means Clustering Using Incomplete Cholesky Factorization

Chen, Li, Zhou, Shuisheng, Ma, Jiajun

Kernel-based clustering algorithm can identify and capture the non-linear structure in datasets, and thereby it can achieve better performance than linear clustering. However, computing and storing the entire kernel matrix occupy so large memory that it is difficult for kernel-based clustering to deal with large-scale datasets. In this paper, we employ incomplete Cholesky factorization to accelerate kernel clustering and save memory space. The key idea of the proposed kernel $k$-means clustering using incomplete Cholesky factorization is that we approximate the entire kernel matrix by the product of a low-rank matrix and its transposition. Then linear $k$-means clustering is applied to columns of the transpose of the low-rank matrix. We show both analytically and empirically that the performance of the proposed algorithm is similar to that of the kernel $k$-means clustering algorithm, but our method can deal with large-scale datasets.

algorithm, dataset, kernel k-means, (14 more...)

2002.02846

Country:

North America > United States (0.14)
Asia > China > Henan Province > Zhengzhou (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Wilde, Henry, Knight, Vincent, Gillard, Jonathan

A novel initialisation based on hospital-resident assignment for the k-modes algorithm

arXiv.org Machine LearningFeb-7-2020

This paper presents a new way of selecting an initial solution for the k-modes algorithm that allows for a notion of mathematical fairness and a leverage of the data that the common initialisations from literature do not. The method, which utilises the Hospital-Resident Assignment Problem to find the set of initial cluster centroids, is compared with the current initialisations on both benchmark datasets and a body of newly generated artificial datasets. Based on this analysis, the proposed method is shown to outperform the other initialisations in the majority of cases, especially when the number of clusters is optimised. In addition, we find that our method outperforms the leading established method specifically for low-density data.

algorithm, dataset, initialisation method, (11 more...)

2002.02701

Country:

North America > United States > New York (0.04)
Asia (0.04)

Genre: Research Report (0.40)

Industry: Health & Medicine > Health Care Providers & Services (0.62)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)