Goto

Collaborating Authors

 Clustering


How to Perform K means clustering Python? - StatAnalytica

#artificialintelligence

The k means clustering Python is one of the unsurprised machine learning methods applied to identify data object clusters within a dataset. There are various kinds of clustering methods, but it has been seen that k means is the oldest and most preferred clustering method. Because of this, k-means clustering in Python is the straightforward method that various data scientists and programmers adopt. If you want to know how to implement k-means clustering Python, then keep scrolling the blog. In this blog, we have covered all the necessary details about the K-means clustering, and an example is also detailed to help you the clustering's functioning.


Clustering City Nightlife using Machine Learning

#artificialintelligence

Everyone knows how Covid-19 pandemic devastated the nightlife industry with social distancing, lockdowns, mask-wearing and early curfews. These nightlife spaces were shuttered because they had been deemed non-essential services and places of easy transmission for the coronavirus. Now that central and state governments in India have eased the restrictions people can finally enjoy a breather, commemorating a special occasion or just spending time with friends over food and drinks. In a city like Pune, which boasts a happening nightlife scene, there's always a party happening somewhere or the other. Widely known as the "IT hub of India", "Automobile and Manufacturing hub of India" and "Oxford of the East", Pune is known for its lifestyle, pleasant weather and just… everything good.


A Relation-Oriented Clustering Method for Open Relation Extraction

arXiv.org Artificial Intelligence

The clustering-based unsupervised relation discovery method has gradually become one of the important methods of open relation extraction (OpenRE). However, high-dimensional vectors can encode complex linguistic information which leads to the problem that the derived clusters cannot explicitly align with the relational semantic classes. In this work, we propose a relation-oriented clustering model and use it to identify the novel relations in the unlabeled data. Specifically, to enable the model to learn to cluster relational data, our method leverages the readily available labeled data of pre-defined relations to learn a relation-oriented representation. We minimize distance between the instance with same relation by gathering the instances towards their corresponding relation centroids to form a cluster structure, so that the learned representation is cluster-friendly. To reduce the clustering bias on predefined classes, we optimize the model by minimizing a joint objective on both labeled and unlabeled data. Experimental results show that our method reduces the error rate by 29.2% and 15.7%, on two datasets respectively, compared with current SOTA methods.


Co-Embedding: Discovering Communities on Bipartite Graphs through Projection

arXiv.org Artificial Intelligence

Many datasets take the form of a bipartite graph where two types of nodes are connected by relationships, like the movies watched by a user or the tags associated with a file. The partitioning of the bipartite graph could be used to fasten recommender systems, or reduce the information retrieval system's index size, by identifying groups of items with similar properties. This type of graph is often processed by algorithms using the Vector Space Model representation, where a binary vector represents an item with 0 and 1. The main problem with this representation is the dimension relatedness, like words' synonymity, which is not considered. This article proposes a co-clustering algorithm using items projection, allowing the measurement of features similarity. We evaluated our algorithm on a cluster retrieval task. Over various datasets, our algorithm produced well balanced clusters with coherent items in, leading to high retrieval scores on this task.


Non-smooth Bayesian Optimization in Tuning Problems

arXiv.org Machine Learning

Building surrogate models is one common approach when we attempt to learn unknown black-box functions. Bayesian optimization provides a framework which allows us to build surrogate models based on sequential samples drawn from the function and find the optimum. Tuning algorithmic parameters to optimize the performance of large, complicated "black-box" application codes is a specific important application, which aims at finding the optima of black-box functions. Within the Bayesian optimization framework, the Gaussian process model produces smooth or continuous sample paths. However, the black-box function in the tuning problem is often non-smooth. This difficult tuning problem is worsened by the fact that we usually have limited sequential samples from the black-box function. Motivated by these issues encountered in tuning, we propose a novel additive Gaussian process model called clustered Gaussian process (cGP), where the additive components are induced by clustering. In the examples we studied, the performance can be improved by as much as 90% among repetitive experiments. By using this surrogate model, we want to capture the non-smoothness of the black-box function. In addition to an algorithm for constructing this model, we also apply the model to several artificial and real applications to evaluate it.


Multi-Level Features Contrastive Networks for Unsupervised Domain Adaptation

arXiv.org Artificial Intelligence

Unsupervised domain adaptation aims to train a model from the labeled source domain to make predictions on the unlabeled target domain when the data distribution of the two domains is different. As a result, it needs to reduce the data distribution difference between the two domains to improve the model's generalization ability. Existing methods tend to align the two domains directly at the domain-level, or perform class-level domain alignment based on deep feature. The former ignores the relationship between the various classes in the two domains, which may cause serious negative transfer, the latter alleviates it by introducing pseudo-labels of the target domain, but it does not consider the importance of performing class-level alignment on shallow feature representations. In this paper, we develop this work on the method of class-level alignment. The proposed method reduces the difference between two domains dramaticlly by aligning multi-level features. In the case that the two domains share the label space, the class-level alignment is implemented by introducing Multi-Level Feature Contrastive Networks (MLFCNet). In practice, since the categories of samples in target domain are unavailable, we iteratively use clustering algorithm to obtain the pseudo-labels, and then minimize Multi-Level Contrastive Discrepancy (MLCD) loss to achieve more accurate class-level alignment. Experiments on three real-world benchmarks ImageCLEF-DA, Office-31 and Office-Home demonstrate that MLFCNet compares favorably against the existing state-of-the-art domain adaptation methods.


Improving Test Case Generation for REST APIs Through Hierarchical Clustering

arXiv.org Artificial Intelligence

With the ever-increasing use of web APIs in modern-day applications, it is becoming more important to test the system as a whole. In the last decade, tools and approaches have been proposed to automate the creation of system-level test cases for these APIs using evolutionary algorithms (EAs). One of the limiting factors of EAs is that the genetic operators (crossover and mutation) are fully randomized, potentially breaking promising patterns in the sequences of API requests discovered during the search. Breaking these patterns has a negative impact on the effectiveness of the test case generation process. To address this limitation, this paper proposes a new approach that uses agglomerative hierarchical clustering (AHC) to infer a linkage tree model, which captures, replicates, and preserves these patterns in new test cases. We evaluate our approach, called LT-MOSA, by performing an empirical study on 7 real-world benchmark applications w.r.t. branch coverage and real-fault detection capability. We also compare LT-MOSA with the two existing state-of-the-art white-box techniques (MIO, MOSA) for REST API testing. Our results show that LT-MOSA achieves a statistically significant increase in test target coverage (i.e., lines and branches) compared to MIO and MOSA in 4 and 5 out of 7 applications, respectively. Furthermore, LT-MOSA discovers 27 and 18 unique real-faults that are left undetected by MIO and MOSA, respectively.


A Scalable Last-Mile Delivery Service: From Simulation to Scaled Experiment

arXiv.org Artificial Intelligence

In this paper, we investigate the problem of a last-mile delivery service that selects up to $N$ available vehicles to deliver $M$ packages from a centralized depot to $M$ delivery locations. The objective of the last-mile delivery service is to jointly maximize customer satisfaction (minimize delivery time) and minimize operating cost (minimize total travel time) by selecting the optimal number of vehicles to perform the deliveries. We model this as an assignment (vehicles to packages) and path planning (determining the delivery order and route) problem, which is equivalent to the NP-hard multiple traveling salesperson problem. We propose a scalable heuristic algorithm, which sacrifices some optimality to achieve a reasonable computational cost for a high number of packages. The algorithm combines hierarchical clustering with a greedy search. To validate our approach, we compare the results of our simulation to experiments in a $1$:$25$ scale robotic testbed for future mobility systems.


Application of Machine Learning in Early Recommendation of Cardiac Resynchronization Therapy

arXiv.org Artificial Intelligence

Heart failure (HF) is a leading cause of morbidity, mortality, and health care costs. Prolonged conduction through the myocardium can occur with HF, and a device-driven approach, termed cardiac resynchronization therapy (CRT), can improve left ventricular (LV) myocardial conduction patterns. While a functional benefit of CRT has been demonstrated, a large proportion of HF patients (30-50%) receiving CRT do not show sufficient improvement. Moreover, identifying HF patients that would benefit from CRT prospectively remains a clinical challenge. Accordingly, strategies to effectively predict those HF patients that would derive a functional benefit from CRT holds great medical and socio-economic importance. Thus, we used machine learning methods of classifying HF patients, namely Cluster Analysis, Decision Trees, and Artificial neural networks, to develop predictive models of individual outcomes following CRT. Clinical, functional, and biomarker data were collected in HF patients before and following CRT. A prospective 6-month endpoint of a reduction in LV volume was defined as a CRT response. Using this approach (418 responders, 412 non-responders), each with 56 parameters, we could classify HF patients based on their response to CRT with more than 95% success. We have demonstrated that using machine learning approaches can identify HF patients with a high probability of a positive CRT response (95% accuracy), and of equal importance, identify those HF patients that would not derive a functional benefit from CRT. Developing this approach into a clinical algorithm to assist in clinical decision-making regarding the use of CRT in HF patients would potentially improve outcomes and reduce health care costs.


An Unsupervised Deep-Learning Method for Fingerprint Classification: the CCAE Network and the Hybrid Clustering Strategy

arXiv.org Artificial Intelligence

The fingerprint classification is an important and effective method to quicken the process and improve the accuracy in the fingerprint matching process. Conventional supervised methods need a large amount of pre-labeled data and thus consume immense human resources. In this paper, we propose a new and efficient unsupervised deep learning method that can extract fingerprint features and classify fingerprint patterns automatically. In this approach, a new model named constraint convolutional auto-encoder (CCAE) is used to extract fingerprint features and a hybrid clustering strategy is applied to obtain the final clusters. A set of experiments in the NIST-DB4 dataset shows that the proposed unsupervised method exhibits the efficient performance on fingerprint classification. For example, the CCAE achieves an accuracy of 97.3% on only 1000 unlabeled fingerprints in the NIST-DB4.