Clustering
Learned Indexing in Proteins: Extended Work on Substituting Complex Distance Calculations with Embedding and Clustering Techniques
Oฤพha, Jaroslav, Slaninรกkovรก, Terรฉzia, Gendiar, Martin, Antol, Matej, Dohnal, Vlastislav
Despite the constant evolution of similarity searching research, it continues to face the same challenges stemming from the complexity of the data, such as the curse of dimensionality and computationally expensive distance functions. Various machine learning techniques have proven capable of replacing elaborate mathematical models with combinations of simple linear functions, often gaining speed and simplicity at the cost of formal guarantees of accuracy and correctness of querying. The authors explore the potential of this research trend by presenting a lightweight solution for the complex problem of 3D protein structure search. The solution consists of three steps -- (i) transformation of 3D protein structural information into very compact vectors, (ii) use of a probabilistic model to group these vectors and respond to queries by returning a given number of similar objects, and (iii) a final filtering step which applies basic vector distance functions to refine the result.
TensorAnalyzer: Identification of Urban Patterns in Big Cities using Non-Negative Tensor Factorization
Silveira, Jaqueline, Garcรญa, Germain, Paiva, Afonso, Nery, Marcelo, Adorno, Sergio, Nonato, Luis Gustavo
Extracting relevant urban patterns from multiple data sources can be difficult using classical clustering algorithms since we have to make a suitable setup of the hyperparameters of the algorithms and deal with outliers. It should be addressed correctly to help urban planners in the decision-making process for the further development of a big city. For instance, experts' main interest in criminology is comprehending the relationship between crimes and the socio-economic characteristics at specific georeferenced locations. In addition, the classical clustering algorithms take little notice of the intricate spatial correlations in georeferenced data sources. This paper presents a new approach to detecting the most relevant urban patterns from multiple data sources based on tensor decomposition. Compared to classical methods, the proposed approach's performance is attested to validate the identified patterns' quality. The result indicates that the approach can effectively identify functional patterns to characterize the data set for further analysis in achieving good clustering quality. Furthermore, we developed a generic framework named TensorAnalyzer, where the effectiveness and usefulness of the proposed methodology are tested by a set of experiments and a real-world case study showing the relationship between the crime events around schools and students performance and other variables involved in the analysis.
Clustering Semantic Predicates in the Open Research Knowledge Graph
Oghli, Omar Arab, D'Souza, Jennifer, Auer, Sรถren
When semantically describing knowledge graphs (KGs), users have to make a critical choice of a vocabulary (i.e. predicates and resources). The success of KG building is determined by the convergence of shared vocabularies so that meaning can be established. The typical lifecycle for a new KG construction can be defined as follows: nascent phases of graph construction experience terminology divergence, while later phases of graph construction experience terminology convergence and reuse. In this paper, we describe our approach tailoring two AI-based clustering algorithms for recommending predicates (in RDF statements) about resources in the Open Research Knowledge Graph (ORKG) https://orkg.org/. Such a service to recommend existing predicates to semantify new incoming data of scholarly publications is of paramount importance for fostering terminology convergence in the ORKG. Our experiments show very promising results: a high precision with relatively high recall in linear runtime performance. Furthermore, this work offers novel insights into the predicate groups that automatically accrue loosely as generic semantification patterns for semantification of scholarly knowledge spanning 44 research fields.
A novel non-linear transformation based multi-user identification algorithm for fixed text keystroke behavioral dynamics
Sahu, Chinmay, Banavar, Mahesh, Schuckers, Stephanie
Abstract--In this paper, we propose a new technique to uniquely classify and identify multiple users accessing a single application using keystroke dynamics. This problem is usually encountered when multiple users have legitimate access to shared computers and accounts, where, at times, one user can inadvertently be logged in on another user's account. Since the login processes are usually bypassed at this stage, we rely on keystroke dynamics in order to tell users apart. Our algorithm uses the quantile transform and techniques from localization to classify and identify users. Specifically, we use an algorithm known as ordinal Unfolding based Localization (UNLOC), which uses only ordinal data obtained from comparing distance proxies, by "locating" users in a reduced PCA/Kernel-PCA/t-SNE space based on their typing patterns. Our results are validated with the help of benchmark keystroke datasets and show that our algorithm outperforms other methods. In this paper, we consider With increasing digital presence, securing sensitive and personal both sources of keystrokes. In general, systems authentication [9], [12], [14], where a profile is built for only or web applications utilize one-time authentication using one user. The algorithms used in single-user authentication single sign-on for providing security. Banking and financial determine whether the user at the keyboard is the user in the institutions generally use a knowledge-based mechanism to model.
Efficient search of active inference policy spaces using k-means
Kiefer, Alex B., Albarracin, Mahault
We develop an approach to policy selection in active inference that allows us to efficiently search large policy spaces by mapping each policy to its embedding in a vector space. We sample the expected free energy of representative points in the space, then perform a more thorough policy search around the most promising point in this initial sample. We consider various approaches to creating the policy embedding space, and propose using k-means clustering to select representative points. We apply our technique to a goal-oriented graph-traversal problem, for which naive policy selection is intractable for even moderately large graphs.
Detection and Evaluation of Clusters within Sequential Data
Van Werde, Alexander, Senen-Cerda, Albert, Kosmella, Gianluca, Sanders, Jaron
Motivated by theoretical advancements in dimensionality reduction techniques we use a recent model, called Block Markov Chains, to conduct a practical study of clustering in real-world sequential data. Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees and can be deployed in sparse data regimes. Despite these favorable theoretical properties, a thorough evaluation of these algorithms in realistic settings has been lacking. We address this issue and investigate the suitability of these clustering algorithms in exploratory data analysis of real-world sequential data. In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets. In order to evaluate the determined clusters, and the associated Block Markov Chain model, we further develop a set of evaluation tools. These tools include benchmarking, spectral noise analysis and statistical model selection tools. An efficient implementation of the clustering algorithm and the new evaluation tools is made available together with this paper. Practical challenges associated to real-world data are encountered and discussed. It is ultimately found that the Block Markov Chain model assumption, together with the tools developed here, can indeed produce meaningful insights in exploratory data analyses despite the complexity and sparsity of real-world data.
On the Robustness of Deep Clustering Models: Adversarial Attacks and Defenses
Chhabra, Anshuman, Sekhari, Ashwin, Mohapatra, Prasant
Clustering models constitute a class of unsupervised machine learning methods which are used in a number of application pipelines, and play a vital role in modern data science. With recent advancements in deep learning -- deep clustering models have emerged as the current state-of-the-art over traditional clustering approaches, especially for high-dimensional image datasets. While traditional clustering approaches have been analyzed from a robustness perspective, no prior work has investigated adversarial attacks and robustness for deep clustering models in a principled manner. To bridge this gap, we propose a blackbox attack using Generative Adversarial Networks (GANs) where the adversary does not know which deep clustering model is being used, but can query it for outputs. We analyze our attack against multiple state-of-the-art deep clustering models and real-world datasets, and find that it is highly successful. We then employ some natural unsupervised defense approaches, but find that these are unable to mitigate our attack. Finally, we attack Face++, a production-level face clustering API service, and find that we can significantly reduce its performance as well. Through this work, we thus aim to motivate the need for truly robust deep clustering models.
A Framework for Web Services Retrieval Using Bio Inspired Clustering
Rayasam, Anirudha, Thota, Siddhartha R, Bukkittu, Avinash N, Kamath, Sowmya
Efficiently discovering relevant Web services with respect to a specific user query has become a growing challenge owing to the incredible growth in the field of web technologies. In previous works, different clustering models have been used to address these issues. But, most of the traditional clustering techniques are computationally intensive and fail to address all the problems involved. Also, the current standards fail to incorporate the semantic relatedness of Web services during clustering and retrieval resulting in decreased performance. In this paper, we propose a framework for web services retrieval that uses a bottom-up, decentralized and self organising approach to cluster available services. It also provides online, dynamic computation of clusters thus overcoming the drawbacks of traditional clustering methods. We also use the semantic similarity between Web services for the clustering process to enhance the precision and lower the recall.
Review of Clustering Methods for Functional Data
Functional data clustering is to identify heterogeneous morphological patterns in the continuous functions underlying the discrete measurements/observations. Application of functional data clustering has appeared in many publications across various fields of sciences, including but not limited to biology, (bio)chemistry, engineering, environmental science, medical science, psychology, social science, etc. The phenomenal growth of the application of functional data clustering indicates the urgent need for a systematic approach to develop efficient clustering methods and scalable algorithmic implementations. On the other hand, there is abundant literature on the cluster analysis of time series, trajectory data, spatio-temporal data, etc., which are all related to functional data. Therefore, an overarching structure of existing functional data clustering methods will enable the cross-pollination of ideas across various research fields. We here conduct a comprehensive review of original clustering methods for functional data. We propose a systematic taxonomy that explores the connections and differences among the existing functional data clustering methods and relates them to the conventional multivariate clustering methods. The structure of the taxonomy is built on three main attributes of a functional data clustering method and therefore is more reliable than existing categorizations. The review aims to bridge the gap between the functional data analysis community and the clustering community and to generate new principles for functional data clustering.