Goto

Collaborating Authors

 Clustering


Constructing Cell-type Taxonomy by Optimal Transport with Relaxed Marginal Constraints

arXiv.org Machine Learning

The rapid emergence of single-cell data has facilitated the study of many different biological conditions at the cellular level. Cluster analysis has been widely applied to identify cell types, capturing the essential patterns of the original data in a much more concise form. One challenge in the cluster analysis of cells is matching clusters extracted from datasets of different origins or conditions. Many existing algorithms cannot recognize new cell types present in only one of the two samples when establishing a correspondence between clusters obtained from two samples. Additionally, when there are more than two samples, it is advantageous to align clusters across all samples simultaneously rather than performing pairwise alignment. Our approach aims to construct a taxonomy for cell clusters across all samples to better annotate these clusters and effectively extract features for downstream analysis. A new system for constructing cell-type taxonomy has been developed by combining the technique of Optimal Transport with Relaxed Marginal Constraints (OT-RMC) and the simultaneous alignment of clusters across multiple samples. OT-RMC allows us to address challenges that arise when the proportions of clusters vary substantially between samples or when some clusters do not appear in all the samples. Experiments on more than twenty datasets demonstrate that the taxonomy constructed by this new system can yield highly accurate annotation of cell types. Additionally, sample-level features extracted based on the taxonomy result in accurate classification of samples.


Detecting Anomalies Using Rotated Isolation Forest

arXiv.org Artificial Intelligence

The Isolation Forest (iForest), proposed by Liu, Ting, and Zhou at TKDE 2012, has become a prominent tool for unsupervised anomaly detection. However, recent research by Hariri, Kind, and Brunner, published in TKDE 2021, has revealed issues with iForest. They identified the presence of axis-aligned ghost clusters that can be misidentified as normal clusters, leading to biased anomaly scores and inaccurate predictions. In response, they developed the Extended Isolation Forest (EIF), which effectively solves these issues by eliminating the ghost clusters introduced by iForest. This enhancement results in improved consistency of anomaly scores and superior performance. We reveal a previously overlooked problem in the Extended Isolation Forest (EIF), showing that it is vulnerable to ghost inter-clusters between normal clusters of data points. In this paper, we introduce the Rotated Isolation Forest (RIF) algorithm which effectively addresses both the axis-aligned ghost clusters observed in iForest and the ghost inter-clusters seen in EIF. RIF accomplishes this by randomly rotating the dataset (using random rotation matrices and QR decomposition) before feeding it into the iForest construction, thereby increasing dataset variation and eliminating ghost clusters. Our experiments conclusively demonstrate that the RIF algorithm outperforms iForest and EIF, as evidenced by the results obtained from both synthetic datasets and real-world datasets.


An eco-driving approach for ride comfort improvement

arXiv.org Artificial Intelligence

New challenges on transport systems are emerging due to the advances that the current paradigm is experiencing. The breakthrough of the autonomous car brings concerns about ride comfort, while the pollution concerns have arisen in recent years. In the model of automated automobiles, drivers are expected to become passengers, so, they will be more prone to suffer from ride discomfort or motion sickness. Conversely, the eco-driving implications should not be set aside because of the influence of pollution on climate and people's health. For that reason, a joint assessment of the aforementioned points would have a positive impact. Thus, this work presents a self-organised map-based solution to assess ride comfort features of individuals considering their driving style from the viewpoint of eco-driving. For this purpose, a previously acquired dataset from an instrumented car was used to classify drivers regarding the causes of their lack of ride comfort and eco-friendliness. Once drivers are classified regarding their driving style, natural-language-based recommendations are proposed to increase the engagement with the system. Hence, potential improvements of up to the 57.7% for ride comfort evaluation parameters, as well as up to the 47.1% in greenhouse-gasses emissions are expected to be reached.


An Intelligent System-on-a-Chip for a Real-Time Assessment of Fuel Consumption to Promote Eco-Driving

arXiv.org Artificial Intelligence

Pollution that originates from automobiles is a concern in the current world, not only because of global warming, but also due to the harmful effects on people's health and lives. Despite regulations on exhaust gas emissions being applied, minimizing unsuitable driving habits that cause elevated fuel consumption and emissions would achieve further reductions. For that reason, this work proposes a self-organized map (SOM)-based intelligent system in order to provide drivers with eco-driving-intended driving style (DS) recommendations. The development of the DS advisor uses driving data from the Uyanik instrumented car. The system classifies drivers regarding the underlying causes of non-optimal DSs from the eco-driving viewpoint. When compared with other solutions, the main advantage of this approach is the personalization of the recommendations that are provided to motorists, comprising the handling of the pedals and the gearbox, with potential improvements in both fuel consumption and emissions ranging from the 9.5\% to the 31.5\%, or even higher for drivers that are strongly engaged with the system. It was successfully implemented using a field-programmable gate array (FPGA) device of the Xilinx ZynQ programmable system-on-a-chip (PSoC) family. This SOM-based system allows for real-time implementation, state-of-the-art timing performances, and low power consumption, which are suitable for developing advanced driving assistance systems (ADASs).


Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition

arXiv.org Artificial Intelligence

We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. It utilizes a cross-lingual embedding clustering method to construct a hierarchical Softmax (H-Softmax) decoder, which enables similar tokens across different languages to share similar decoder representations. It addresses the limitations of the previous Huffman-based H-Softmax method, which relied on shallow features in token similarity assessments. Through experiments on a downsampled dataset of 15 languages, we demonstrate the effectiveness of our approach in improving low-resource multilingual ASR accuracy.


From tools to thieves: Measuring and understanding public perceptions of AI through crowdsourced metaphors

arXiv.org Artificial Intelligence

How has the public responded to the increasing prevalence of artificial intelligence (AI)-based technologies? We investigate public perceptions of AI by collecting over 12,000 responses over 12 months from a nationally representative U.S. sample. Participants provided open-ended metaphors reflecting their mental models of AI, a methodology that overcomes the limitations of traditional self-reported measures. Using a mixed-methods approach combining quantitative clustering and qualitative coding, we identify 20 dominant metaphors shaping public understanding of AI. To analyze these metaphors systematically, we present a scalable framework integrating language modeling (LM)-based techniques to measure key dimensions of public perception: anthropomorphism (attribution of human-like qualities), warmth, and competence. We find that Americans generally view AI as warm and competent, and that over the past year, perceptions of AI's human-likeness and warmth have significantly increased ($+34\%, r = 0.80, p < 0.01; +41\%, r = 0.62, p < 0.05$). Furthermore, these implicit perceptions, along with the identified dominant metaphors, strongly predict trust in and willingness to adopt AI ($r^2 = 0.21, 0.18, p < 0.001$). We further explore how differences in metaphors and implicit perceptions--such as the higher propensity of women, older individuals, and people of color to anthropomorphize AI--shed light on demographic disparities in trust and adoption. In addition to our dataset and framework for tracking evolving public attitudes, we provide actionable insights on using metaphors for inclusive and responsible AI development.


A spectral clustering-type algorithm for the consistent estimation of the Hurst distribution in moderately high dimensions

arXiv.org Machine Learning

Scale invariance (fractality) is a prominent feature of the large-scale behavior of many stochastic systems. In this work, we construct an algorithm for the statistical identification of the Hurst distribution (in particular, the scaling exponents) undergirding a high-dimensional fractal system. The algorithm is based on wavelet random matrices, modified spectral clustering and a model selection step for picking the value of the clustering precision hyperparameter. In a moderately high-dimensional regime where the dimension, the sample size and the scale go to infinity, we show that the algorithm consistently estimates the Hurst distribution. Monte Carlo simulations show that the proposed methodology is efficient for realistic sample sizes and outperforms another popular clustering method based on mixed-Gaussian modeling. We apply the algorithm in the analysis of real-world macroeconomic time series to unveil evidence for cointegration.


A Survey on Cluster-based Federated Learning

arXiv.org Machine Learning

As the industrial and commercial use of Federated Learning (FL) has expanded, so has the need for optimized algorithms. In settings were FL clients' data is non-independently and identically distributed (non-IID) and with highly heterogeneous distributions, the baseline FL approach seems to fall short. To tackle this issue, recent studies, have looked into personalized FL (PFL) which relaxes the implicit single-model constraint and allows for multiple hypotheses to be learned from the data or local models. Among the personalized FL approaches, cluster-based solutions (CFL) are particularly interesting whenever it is clear -through domain knowledge -that the clients can be separated into groups. In this paper, we study recent works on CFL, proposing: i) a classification of CFL solutions for personalization; ii) a structured review of literature iii) a review of alternative use cases for CFL. CCS Concepts: $\bullet$ General and reference $\rightarrow$ Surveys and overviews; $\bullet$ Computing methodologies $\rightarrow$ Machine learning; $\bullet$ Information systems $\rightarrow$ Clustering; $\bullet$ Security and privacy $\rightarrow$ Privacy-preserving protocols.


Multi-View Spectral Clustering for Graphs with Multiple View Structures

arXiv.org Artificial Intelligence

Despite the fundamental importance of clustering, to this day, much of the relevant research is still based on ambiguous foundations, leading to an unclear understanding of whether or how the various clustering methods are connected with each other. In this work, we provide an additional stepping stone towards resolving such ambiguities by presenting a general clustering framework that subsumes a series of seemingly disparate clustering methods, including various methods belonging to the widely popular spectral clustering framework. In fact, the generality of the proposed framework is additionally capable of shedding light to the largely unexplored area of multi-view graphs where each view may have differently clustered nodes. In turn, we propose GenClus: a method that is simultaneously an instance of this framework and a generalization of spectral clustering, while also being closely related to k-means as well. This results in a principled alternative to the few existing methods studying this special type of multi-view graphs. Then, we conduct in-depth experiments, which demonstrate that GenClus is more computationally efficient than existing methods, while also attaining similar or better clustering performance. Lastly, a qualitative real-world case-study further demonstrates the ability of GenClus to produce meaningful clusterings.


DBSCAN in domains with periodic boundary conditions

arXiv.org Artificial Intelligence

Many scientific problems involve data that is embedded in a space with periodic boundary conditions. This can for instance be related to an inherent cyclic or rotational symmetry in the data or a spatially extended periodicity. When analyzing such data, well-tailored methods are needed to obtain efficient approaches that obey the periodic boundary conditions of the problem. In this work, we present a method for applying a clustering algorithm to data embedded in a periodic domain based on the DBSCAN algorithm, a widely used unsupervised machine learning method that identifies clusters in data. The proposed method internally leverages the conventional DBSCAN algorithm for domains with open boundaries, such that it remains compatible with all optimized implementations for neighborhood searches in open domains. In this way, it retains the same optimized runtime complexity of $O(N\log N)$. We demonstrate the workings of the proposed method using synthetic data in one, two and three dimensions and also apply it to a real-world example involving the clustering of bubbles in a turbulent flow. The proposed approach is implemented in a ready-to-use Python package that we make publicly available.