Goto

Collaborating Authors

 Clustering


Approximation Algorithms for Socially Fair Clustering

arXiv.org Machine Learning

We present an $(e^{O(p)} \frac{\log \ell}{\log\log\ell})$-approximation algorithm for socially fair clustering with the $\ell_p$-objective. In this problem, we are given a set of points in a metric space. Each point belongs to one (or several) of $\ell$ groups. The goal is to find a $k$-medians, $k$-means, or, more generally, $\ell_p$-clustering that is simultaneously good for all of the groups. More precisely, we need to find a set of $k$ centers $C$ so as to minimize the maximum over all groups $j$ of $\sum_{u \text{ in group }j} d(u,C)^p$. The socially fair clustering problem was independently proposed by Abbasi, Bhaskara, and Venkatasubramanian [2021] and Ghadiri, Samadi, and Vempala [2021]. Our algorithm improves and generalizes their $O(\ell)$-approximation algorithms for the problem. The natural LP relaxation for the problem has an integrality gap of $\Omega(\ell)$. In order to obtain our result, we introduce a strengthened LP relaxation and show that it has an integrality gap of $\Theta(\frac{\log \ell}{\log\log\ell})$ for a fixed $p$. Additionally, we present a bicriteria approximation algorithm, which generalizes the bicriteria approximation of Abbasi et al. [2021].


The 10 data mining techniques data scientists need for their toolbox

#artificialintelligence

At their core, data scientists have a math and statistics background. Out of this math background, they're creating advanced analytics. Just like their software engineering counterparts, data scientists will have to interact with the business side. This includes understanding the domain enough to make insights. Data scientists are often tasked with analyzing data to help the business, and this requires a level of business acumen. Finally, their results need to be given to the business in an understandable fashion. This requires the ability to verbally and visually communicate complex results and observations in a way that the business can understand and act on them. Thus, it'll be extremely valuable for any aspiring data scientists to learn data mining -- the process where one structures the raw data and formulate or recognize the various patterns in the data through the mathematical and computational algorithms. This helps to generate new information and unlock various insights. Here is a simple list of reasons on why you should study data mining? There is a heavy demand for deep analytical talent at the moment in the tech industry. You can gain a valuable skill if you want to jump into Data Science / Big Data / Predictive Analytics. Given lots of data, you'll be able to discover patterns and models that are valid, useful, unexpected, and understandable. Use some variables to predict unknown or future values of other variables (Predictive). You can activate your knowledge in CS theory, Machine Learning, and Databases. Last but not least, you'll learn a lot about algorithms, computing architectures, data scalability, and automation for handling massive datasets.


Meta-learning representations for clustering with infinite Gaussian mixture models

arXiv.org Machine Learning

For better clustering performance, appropriate representations are critical. Although many neural network-based metric learning methods have been proposed, they do not directly train neural networks to improve clustering performance. We propose a meta-learning method that train neural networks for obtaining representations such that clustering performance improves when the representations are clustered by the variational Bayesian (VB) inference with an infinite Gaussian mixture model. The proposed method can cluster unseen unlabeled data using knowledge meta-learned with labeled data that are different from the unlabeled data. For the objective function, we propose a continuous approximation of the adjusted Rand index (ARI), by which we can evaluate the clustering performance from soft clustering assignments. Since the approximated ARI and the VB inference procedure are differentiable, we can backpropagate the objective function through the VB inference procedure to train the neural networks. With experiments using text and image data sets, we demonstrate that our proposed method has a higher adjusted Rand index than existing methods do.


Expert decision support system for aeroacoustic classification

arXiv.org Artificial Intelligence

This paper presents an expert decision support system for time-invariant aeroacoustic source classification. The system comprises two steps: first, the calculation of acoustic properties based on spectral and spatial information; and second, the clustering of the sources based on these properties. Example data of two scaled airframe half-model wind tunnel measurements is evaluated based on deconvolved beamforming maps. A variety of aeroacoustic features are proposed that capture the characteristics and properties of the spectra. These features represent aeroacoustic properties that can be interpreted by both the machine and experts. The features are independent of absolute flow parameters such as the observed Mach numbers. This enables the proposed method to analyze data which is measured at different flow configurations. The aeroacoustic sources are clustered based on these features to determine similar or atypical behavior. For the given example data, the method results in source type clusters that correspond to human expert classification of the source types. Combined with a classification confidence and the mean feature values for each cluster, these clusters help aeroacoustic experts in classifying the identified sources and support them in analyzing their typical behavior and identifying spurious sources in-situ during measurement campaigns.


Fully Explained DBScan Clustering Algorithm with Python

#artificialintelligence

In this article, we will discuss the machine learning clustering-based algorithm that is the DBScan cluster. The approach in this cluster algorithm is density-based than another distance-based approach. The other cluster which is distance-based looks for closeness in data points but also misclassifies if the point belongs to another class. So, density-based clustering is suited in this kind of scenario. The cluster algorithms come in unsupervised learning in which we don't rely on target variables to make clusters.


A New K means Grey Wolf Algorithm for Engineering Problems

arXiv.org Artificial Intelligence

Purpose: The development of metaheuristic algorithms has increased by researchers to use them extensively in the field of business, science, and engineering. One of the common metaheuristic optimization algorithms is called Grey Wolf Optimization (GWO). The algorithm works based on imitation of the wolves' searching and the process of attacking grey wolves. The main purpose of this paper to overcome the GWO problem which is trapping into local optima. Design or Methodology or Approach: In this paper, the K-means clustering algorithm is used to enhance the performance of the original Grey Wolf Optimization by dividing the population into different parts. The proposed algorithm is called K-means clustering Grey Wolf Optimization (KMGWO). Findings: Results illustrate the efficiency of KMGWO is superior to GWO. To evaluate the performance of the KMGWO, KMGWO applied to solve 10 CEC2019 benchmark test functions. Results prove that KMGWO is better compared to GWO. KMGWO is also compared to Cat Swarm Optimization (CSO), Whale Optimization Algorithm-Bat Algorithm (WOA-BAT), and WOA, so, KMGWO achieves the first rank in terms of performance. Statistical results proved that KMGWO achieved a higher significant value compared to the compared algorithms. Also, the KMGWO is used to solve a pressure vessel design problem and it has outperformed results. Originality/value: Results prove that KMGWO is superior to GWO. KMGWO is also compared to cat swarm optimization (CSO), whale optimization algorithm-bat algorithm (WOA-BAT), WOA, and GWO so KMGWO achieved the first rank in terms of performance. Also, the KMGWO is used to solve a classical engineering problem and it is superior


Similarity measure for sparse time course data based on Gaussian processes

arXiv.org Machine Learning

We propose a similarity measure for sparsely sampled time course data in the form of a log-likelihood ratio of Gaussian processes (GP). The proposed GP similarity is similar to a Bayes factor and provides enhanced robustness to noise in sparse time series, such as those found in various biological settings, e.g., gene transcriptomics. We show that the GP measure is equivalent to the Euclidean distance when the noise variance in the GP is negligible compared to the noise variance of the signal. Our numerical experiments on both synthetic and real data show improved performance of the GP similarity when used in conjunction with two distance-based clustering methods.


CAC: A Clustering Based Framework for Classification

arXiv.org Artificial Intelligence

In data containing heterogeneous subpopulations, classification performance benefits from incorporating the knowledge of cluster structure in the classifier. Previous methods for such combined clustering and classification either are classifier-specific and not generic or independently perform clustering and classifier training, which may not form clusters that can potentially benefit classifier performance. The question of how to perform clustering to improve the performance of classifiers trained on the clusters has received scant attention in previous literature despite its importance in several real-world applications. In this paper, we theoretically analyze when and how clustering may help in obtaining accurate classifiers. We design a simple, efficient, and generic framework called Classification Aware Clustering (CAC), to find clusters that are well suited for being used as training datasets by classifiers for each underlying subpopulation. Our experiments on synthetic and real benchmark datasets demonstrate the efficacy of CAC over previous methods for combined clustering and classification.


A Fast Heuristic for Gateway Location in Wireless Backhaul of 5G Ultra-Dense Networks

arXiv.org Artificial Intelligence

In 5G Ultra-Dense Networks, a distributed wireless backhaul is an attractive solution for forwarding traffic to the core. The macro-cell coverage area is divided into many small cells. A few of these cells are designated as gateways and are linked to the core by high-capacity fiber optic links. Each small cell is associated with one gateway and all small cells forward their traffic to their respective gateway through multi-hop mesh networks. We investigate the gateway location problem and show that finding near-optimal gateway locations improves the backhaul network capacity. An exact p-median integer linear program is formulated for comparison with our novel K-GA heuristic that combines a Genetic Algorithm (GA) with K-means clustering to find near-optimal gateway locations. We compare the performance of KGA with six other approaches in terms of average number of hops and backhaul network capacity at different node densities through extensive Monte Carlo simulations. All approaches are tested in various user distribution scenarios, including uniform distribution, bivariate Gaussian distribution, and cluster distribution. In all cases K-GA provides near-optimal results, achieving average number of hops and backhaul network capacity within 2% of optimal while saving an average of 95% of the execution time.


A Comprehensive Review of Computer-aided Whole-slide Image Analysis: from Datasets to Feature Extraction, Segmentation, Classification, and Detection Approaches

arXiv.org Artificial Intelligence

With the development of computer-aided diagnosis (CAD) and image scanning technology, Whole-slide Image (WSI) scanners are widely used in the field of pathological diagnosis. Therefore, WSI analysis has become the key to modern digital pathology. Since 2004, WSI has been used more and more in CAD. Since machine vision methods are usually based on semi-automatic or fully automatic computers, they are highly efficient and labor-saving. The combination of WSI and CAD technologies for segmentation, classification, and detection helps histopathologists obtain more stable and quantitative analysis results, save labor costs and improve diagnosis objectivity. This paper reviews the methods of WSI analysis based on machine learning. Firstly, the development status of WSI and CAD methods are introduced. Secondly, we discuss publicly available WSI datasets and evaluation metrics for segmentation, classification, and detection tasks. Then, the latest development of machine learning in WSI segmentation, classification, and detection are reviewed continuously. Finally, the existing methods are studied, the applicabilities of the analysis methods are analyzed, and the application prospects of the analysis methods in this field are forecasted.