Goto

Collaborating Authors

 Clustering


A Survey of Some Density Based Clustering Techniques

arXiv.org Artificial Intelligence

Density Based Clustering are a type of Clustering methods using in data mining for extracting previously unknown patterns from data sets. There are a number of density based clustering methods such as DBSCAN, OPTICS, DENCLUE, VDBSCAN, DVBSCAN, DBCLASD and ST-DBSCAN. In this paper, a study of these methods is done along with their characteristics, advantages and disadvantages and most importantly, their applicability to different types of data sets to mine useful and appropriate patterns.


Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

arXiv.org Artificial Intelligence

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4 other languages from the Zerospeech Challenge, in some cases beating the previous state-of-the-art.


Can Evolutionary Clustering Have Theoretical Guarantees?

arXiv.org Artificial Intelligence

Clustering is a fundamental problem in many areas, which aims to partition a given data set into groups based on some distance measure, such that the data points in the same group are similar while that in different groups are dissimilar. Due to its importance and NP-hardness, a lot of methods have been proposed, among which evolutionary algorithms are a class of popular ones. Evolutionary clustering has found many successful applications, but all the results are empirical, lacking theoretical support. This paper fills this gap by proving that the approximation performance of the GSEMO (a simple multi-objective evolutionary algorithm) for solving four formulations of clustering, i.e., $k$-tMM, $k$-center, discrete $k$-median and $k$-means, can be theoretically guaranteed. Furthermore, we consider clustering under fairness, which tries to avoid algorithmic bias, and has recently been an important research topic in machine learning. We prove that for discrete $k$-median clustering under individual fairness, the approximation performance of the GSEMO can be theoretically guaranteed with respect to both the objective function and the fairness constraint.


Compositional Clustering: Applications to Multi-Label Object Recognition and Speaker Identification

arXiv.org Artificial Intelligence

The goal is not just to partition the data into distinct and coherent groups, but also to infer the compositional relationships among the groups. This scenario arises in speaker diarization (i.e., infer who is speaking when from an audio wave) in the presence of simultaneous speech from multiple speakers [6, 36], which occurs frequently in real-world speech settings: The audio at each time t is generated as a composition of the voices of all the people speaking at time t, and the goal is to cluster the audio samples, over all timesteps, into sets of speakers. Hence, if there are 2 people who sometimes speak by themselves and sometimes speak simultaneously, then the clusters would correspond to the speaker sets {1}, {2}, and {1, 2} - the third cluster is not a third independent speaker, but rather the composition of the first two speakers. An analogous scenario arises in open-world (i.e., test classes are disjoint from training classes) multi-label object recognition when clustering images such that each image may contain multiple objects from a fixed set (e.g., the shapes in Figure 1). In some scenarios, the composition function that specifies how examples are generated from other examples might be as simple as superposition by element-wise maximum or addition. However, a more powerful form of composition - and the main motivation for our work - is enabled by compositional embedding models, which are a new technique for few-shot learning.


Unsupervised Embedding Learning for Human Activity Recognition Using Wearable Sensor Data

arXiv.org Artificial Intelligence

The embedded sensors in widely used smartphones and other wearable devices make the data of human activities more accessible. However, recognizing different human activities from the wearable sensor data remains a challenging research problem in ubiquitous computing. One of the reasons is that the majority of the acquired data has no labels. In this paper, we present an unsupervised approach, which is based on the nature of human activity, to project the human activities into an embedding space in which similar activities will be located closely together. Using this, subsequent clustering algorithms can benefit from the embeddings, forming behavior clusters that represent the distinct activities performed by a person. Results of experiments on three labeled benchmark datasets demonstrate the effectiveness of the framework and show that our approach can help the clustering algorithm achieve improved performance in identifying and categorizing the underlying human activities compared to unsupervised techniques applied directly to the original data set.


A multi-modal representation of El Ni\~no Southern Oscillation Diversity

arXiv.org Artificial Intelligence

The El Niño-Southern Oscillation (ENSO), characterized by anomalous sea surface temperature (SST) in the tropical Pacific, exhibits notable diversity in its temporal evolution and spatial distribution of anomalies. The El Niño events of 1982-83 and 1997-98, for instance, recorded exceptionally high sea surface temperature anomaly (SSTA) values in the eastern equatorial Pacific, whereas the El Niño of 2002-03 were notably less extreme and primarily restricted to the central equatorial Pacific (McPhaden, 2004). Despite each being categorized as an El Niño, the 2002-03 event exhibited global climate conditions distinct from those of the earlier two events. In order to describe these event-to-event differences, El Niño events have been categorized as Eastern Pacific (EP), and Central Pacific (CP) types (Capotondi et al., 2020). EP El Niño events typically have their peak SSTA in the Eastern Pacific, exhibit stronger intensities, and a largely reduced zonal thermocline slope, resulting in the discharge of warm water from the equatorial thermocline. In contrast, CP events show peak SSTA in the Central Pacific and are comparatively weaker with more limited changes in zonal thermocline slope and reduced warm water discharge (Kug, Jin, and An, 2009; Capotondi, 2013). Despite considerable research, the underlying causes of ENSO diversity remain elusive (Lee and McPhaden, 2010; Capotondi et al., 2015; Capotondi et al., 2020). And while some general circulation models (GCMs) do exhibit ENSO event-to-event differences, their representation of ENSO diversity appears to be model dependent and is often different in intensity, pattern and duration than observed (Cai et al., 2018). The different types of ENSO events have substantially different downstream impacts on the global climate and dynamics (Strnad et al., 2022).


Analysis of Elephant Movement in Sub-Saharan Africa: Ecological, Climatic, and Conservation Perspectives

arXiv.org Artificial Intelligence

The interaction between elephants and their environment has profound implications for both ecology and conservation strategies. This study presents an analytical approach to decipher the intricate patterns of elephant movement in Sub-Saharan Africa, concentrating on key ecological drivers such as seasonal variations and rainfall patterns. Despite the complexities surrounding these influential factors, our analysis provides a holistic view of elephant migratory behavior in the context of the dynamic African landscape. Our comprehensive approach enables us to predict the potential impact of these ecological determinants on elephant migration, a critical step in establishing informed conservation strategies. This projection is particularly crucial given the impacts of global climate change on seasonal and rainfall patterns, which could substantially influence elephant movements in the future. The findings of our work aim to not only advance the understanding of movement ecology but also foster a sustainable coexistence of humans and elephants in Sub-Saharan Africa. By predicting potential elephant routes, our work can inform strategies to minimize human-elephant conflict, effectively manage land use, and enhance anti-poaching efforts. This research underscores the importance of integrating movement ecology and climatic variables for effective wildlife management and conservation planning.


Syntactic vs Semantic Linear Abstraction and Refinement of Neural Networks

arXiv.org Artificial Intelligence

Abstraction is a key verification technique to improve scalability. However, its use for neural networks is so far extremely limited. Previous approaches for abstracting classification networks replace several neurons with one of them that is similar enough. We can classify the similarity as defined either syntactically (using quantities on the connections between neurons) or semantically (on the activation values of neurons for various inputs). Unfortunately, the previous approaches only achieve moderate reductions, when implemented at all. In this work, we provide a more flexible framework, where a neuron can be replaced with a linear combination of other neurons, improving the reduction. We apply this approach both on syntactic and semantic abstractions, and implement and evaluate them experimentally. Further, we introduce a refinement method for our abstractions, allowing for finding a better balance between reduction and precision.


Topological Point Cloud Clustering

arXiv.org Artificial Intelligence

We present Topological Point Cloud Clustering (TPCC), a new method to cluster points in an arbitrary point cloud based on their contribution to global topological features. TPCC synthesizes desirable features from spectral clustering and topological data analysis and is based on considering the spectral properties of a simplicial complex associated to the considered point cloud. As it is based on considering sparse eigenvector computations, TPCC is similarly easy to interpret and implement as spectral clustering. However, by focusing not just on a single matrix associated to a graph created from the point cloud data, but on a whole set of Hodge-Laplacians associated to an appropriately constructed simplicial complex, we can leverage a far richer set of topological features to characterize the data points within the point cloud and benefit from the relative robustness of topological techniques against noise. We test the performance of TPCC on both synthetic and real-world data and compare it with classical spectral clustering.


Determination of the critical points for systems of directed percolation class using machine learning

arXiv.org Machine Learning

Recently, machine learning algorithms have been used remarkably to study the equilibrium phase transitions, however there are only a few works have been done using this technique in the nonequilibrium phase transitions. In this work, we use the supervised learning with the convolutional neural network (CNN) algorithm and unsupervised learning with the density-based spatial clustering of applications with noise (DBSCAN) algorithm to study the nonequilibrium phase transition in two models. We use CNN and DBSCAN in order to determine the critical points for directed bond percolation (bond DP) model and Domany-Kinzel cellular automaton (DK) model. Both models have been proven to have a nonequilibrium phase transition belongs to the directed percolation (DP) universality class. In the case of supervised learning we train CNN using the images which are generated from Monte Carlo simulations of directed bond percolation. We use that trained CNN in studding the phase transition for the two models. In the case of unsupervised learning, we train DBSCAN using the raw data of Monte Carlo simulations. In this case, we retrain DBSCAN at each time we change the model or lattice size. Our results from both algorithms show that, even for a very small values of lattice size, machine can predict the critical points accurately for both models. Finally, we mention to that, the value of the critical point we find here for bond DP model using CNN or DBSCAN is exactly the same value that has been found using transfer learning with a domain adversarial neural network (DANN) algorithm.