Goto

Collaborating Authors

 Clustering


Understanding KMeans Clustering for Data Science Beginners

#artificialintelligence

Clustering is an unsupervised learning method whose job is to separate the population or data points into several groups, such that data points in a group are more similar to each other dissimilar to the data points of other groups. It is nothing but a collection of objects based on similarity and dissimilarity between them. KMeans clustering is an Unsupervised Machine Learning algorithm that does the clustering task. In this method, the'n' observations are grouped into'K' clusters based on the distance. The algorithm tries to minimize the within-cluster variance(so that similar observations fall in the same cluster).


Kinematics clustering enables head impact subtyping for better traumatic brain injury prediction

arXiv.org Artificial Intelligence

Traumatic brain injury can be caused by various types of head impacts. However, due to different kinematic characteristics, many brain injury risk estimation models are not generalizable across the variety of impacts that humans may sustain. The current definitions of head impact subtypes are based on impact sources (e.g., football, traffic accident), which may not reflect the intrinsic kinematic similarities of impacts across the impact sources. To investigate the potential new definitions of impact subtypes based on kinematics, 3,161 head impacts from various sources including simulation, college football, mixed martial arts, and car racing were collected. We applied the K-means clustering to cluster the impacts on 16 standardized temporal features from head rotation kinematics. Then, we developed subtype-specific ridge regression models for cumulative strain damage (using the threshold of 15%), which significantly improved the estimation accuracy compared with the baseline method which mixed impacts from different sources and developed one model (R^2 from 0.7 to 0.9). To investigate the effect of kinematic features, we presented the top three critical features (maximum resultant angular acceleration, maximum angular acceleration along the z-axis, maximum linear acceleration along the y-axis) based on regression accuracy and used logistic regression to find the critical points for each feature that partitioned the subtypes. This study enables researchers to define head impact subtypes in a data-driven manner, which leads to more generalizable brain injury risk estimation.


Clustering Large Data Sets with Incremental Estimation of Low-density Separating Hyperplanes

arXiv.org Machine Learning

An efficient method for obtaining low-density hyperplane separators in the unsupervised context is proposed. Low density separators can be used to obtain a partition of a set of data based on their allocations to the different sides of the separators. The proposed method is based on applying stochastic gradient descent to the integrated density on the hyperplane with respect to a convolution of the underlying distribution and a smoothing kernel. In the case where the bandwidth of the smoothing kernel is decreased towards zero, the bias of these updates with respect to the true underlying density tends to zero, and convergence to a minimiser of the density on the hyperplane can be obtained. A post-processing of the partition induced by a collection of low-density hyperplanes yields an efficient and accurate clustering method which is capable of automatically selecting an appropriate number of clusters. Experiments with the proposed approach show that it is highly competitive in terms of both speed and accuracy when compared with relevant benchmarks. Code to implement the proposed approach is available in the form of an R package from https: //github.com/DavidHofmeyr/iMDH.


Nearest Neighborhood-Based Deep Clustering for Source Data-absent Unsupervised Domain Adaptation

arXiv.org Artificial Intelligence

In the classic setting of unsupervised domain adaptation (UDA), the labeled source data are available in the training phase. However, in many real-world scenarios, owing to some reasons such as privacy protection and information security, the source data is inaccessible, and only a model trained on the source domain is available. This paper proposes a novel deep clustering method for this challenging task. Aiming at the dynamical clustering at feature-level, we introduce extra constraints hidden in the geometric structure between data to assist the process. Concretely, we propose a geometry-based constraint, named semantic consistency on the nearest neighborhood (SCNNH), and use it to encourage robust clustering. To reach this goal, we construct the nearest neighborhood for every target data and take it as the fundamental clustering unit by building our objective on the geometry. Also, we develop a more SCNNH-compliant structure with an additional semantic credibility constraint, named semantic hyper-nearest neighborhood (SHNNH). After that, we extend our method to this new geometry. Extensive experiments on three challenging UDA datasets indicate that our method achieves state-of-the-art results. The proposed method has significant improvement on all datasets (as we adopt SHNNH, the average accuracy increases by over 3.0% on the large-scaled dataset). Code is available at https://github.com/tntek/N2DCX.


Electrical peak demand forecasting- A review

arXiv.org Artificial Intelligence

The power system is undergoing rapid evolution with the roll-out of advanced metering infrastructure and local energy applications (e.g. electric vehicles) as well as the increasing penetration of intermittent renewable energy at both transmission and distribution level, which characterizes the peak load demand with stronger randomness and less predictability and therefore poses a threat to the power grid security. Since storing large quantities of electricity to satisfy load demand is neither economically nor environmentally friendly, effective peak demand management strategies and reliable peak load forecast methods become essential for optimizing the power system operations. To this end, this paper provides a timely and comprehensive overview of peak load demand forecast methods in the literature. To our best knowledge, this is the first comprehensive review on such topic. In this paper we first give a precise and unified problem definition of peak load demand forecast. Second, 139 papers on peak load forecast methods were systematically reviewed where methods were classified into different stages based on the timeline. Thirdly, a comparative analysis of peak load forecast methods are summarized and different optimizing methods to improve the forecast performance are discussed. The paper ends with a comprehensive summary of the reviewed papers and a discussion of potential future research directions.


Predicting Popularity of Images Over 30 Days

arXiv.org Artificial Intelligence

The current work deals with the problem of attempting to predict the popularity of images before even being uploaded. This method is specifically focused on Flickr images. Social features of each image as well as that of the user who had uploaded it, have been recorded. The dataset also includes the engagement score of each image which is the ground truth value of the views obtained by each image over a period of 30 days. The work aims to predict the popularity of images on Flickr over a period of 30 days using the social features of the user and the image, as well as the visual features of the images. The method states that the engagement sequence of an image can be said to depend on two independent quantities, namely scale and shape of an image. Once the shape and scale of an image have been predicted, combining them the predicted sequence of an image over 30 days is obtained. The current work follows a previous work done in the same direction, with certain speculations and suggestions of improvement.


Distribution free optimality intervals for clustering

arXiv.org Machine Learning

We address the problem of validating the ouput of clustering algorithms. Given data $\mathcal{D}$ and a partition $\mathcal{C}$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data? This paper introduces a paradigm in which a clustering $\mathcal{C}$ is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations. Furthermore, we present a generic method to obtain post-inference guarantees of near-optimality and stability for a clustering $\mathcal{C}$. The method can be instantiated for a variety of clustering criteria (also called loss functions) for which convex relaxations exist. Obtaining the guarantees amounts to solving a convex optimization problem. We demonstrate the practical relevance of this method by obtaining guarantees for the K-means and the Normalized Cut clustering criteria on realistic data sets. We also prove that asymptotic instability implies finite sample instability w.h.p., allowing inferences about the population clusterability from a sample. The guarantees do not depend on any distributional assumptions, but they depend on the data set $\mathcal{D}$ admitting a stable clustering.


Efficient Sparse Spherical k-Means for Document Clustering

arXiv.org Artificial Intelligence

Spherical k-Means is frequently used to cluster document collections because it performs reasonably well in many settings and is computationally efficient. However, the time complexity increases linearly with the number of clusters k, which limits the suitability of the algorithm for larger values of k depending on the size of the collection. Optimizations targeted at the Euclidean k-Means algorithm largely do not apply because the cosine distance is not a metric. We therefore propose an efficient indexing structure to improve the scalability of Spherical k-Means with respect to k. Our approach exploits the sparsity of the input vectors and the convergence behavior of k-Means to reduce the number of comparisons on each iteration significantly.


Demonstrating REACT: a Real-time Educational AI-powered Classroom Tool

arXiv.org Artificial Intelligence

We present a demonstration of REACT, a new Real-time Educational AI-powered Classroom Tool that employs EDM techniques for supporting the decision-making process of educators. REACT is a data-driven tool with a user-friendly graphical interface. It analyzes students' performance data and provides context-based alerts as well as recommendations to educators for course planning. Furthermore, it incorporates model-agnostic explanations for bringing explainability and interpretability in the process of decision making. This paper demonstrates a use case scenario of our proposed tool using a real-world dataset and presents the design of its architecture and user interface. This demonstration focuses on the agglomerative clustering of students based on their performance (i.e., incorrect responses and hints used) during an in-class activity. This formation of clusters of students with similar strengths and weaknesses may help educators to improve their course planning by identifying at-risk students, forming study groups, or encouraging tutoring between students of different strengths.


Advanced K-Means: Controlling Groups Sizes and Selecting Features

#artificialintelligence

The algorithm uses ideas from Linear Programming, in particular Network Models. Networks models are used, among other things, in logistics to optimise the flow of goods across a network of roads. We can see in the simple figure above that we have 5 nodes with directed arcs (the arrows) between them. Each node has a demand (negative) or supply (positive) value and the arcs have flow and cost values. For instance, the arc 2–4 has a flow of 4 and a cost of $2. Similarly, node 1 supplies 20 units and node 4 requires 5 units.