Goto

Collaborating Authors

 Clustering


Prismatic: Interactive Multi-View Cluster Analysis of Concept Stocks

arXiv.org Artificial Intelligence

Prismatic enables interactive cluster have developed hierarchical clusters (e.g., economic sectors analysis with three key analytical processes: 1) cluster generation such as energy and real estate) qualitatively to describe the by holistically overviewing the dynamic data-driven affinity of different business entities based on their market clusters, 2) cluster exploration by contextualizing the clusters coverage and product specialization [2]. To address rapid with business relational knowledge, and 3) cluster validation market changes, professional traders have introduced concept by analyzing temporal correlation patterns at different time stocks [3], hereafter "concepts," to symbolize companies with scales and time horizons. Qualitative analysis within Prismatic shared business operations or similar business models in the relies on business relational knowledge formulated in a multilayer short term. Entities within the cluster are influenced by similar network. We employed a multi-view clustering method sets of economic factors that induce business-specific risks and to consolidate the multiple facets and augment correlationbased opportunities.


DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation Models

arXiv.org Artificial Intelligence

Effective DNA embedding remains crucial in genomic analysis, particularly in scenarios lacking labeled data for model fine-tuning, despite the significant advancements in genome foundation models. A prime example is metagenomics binning, a critical process in microbiome research that aims to group DNA sequences by their species from a complex mixture of DNA sequences derived from potentially thousands of distinct, often uncharacterized species. To fill the lack of effective DNA embedding models, we introduce DNABERT-S, a genome foundation model that specializes in creating species-aware DNA embeddings. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C$^2$LR) strategy. Empirical results on 18 diverse datasets showed DNABERT-S's remarkable performance. It outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training while doubling the Adjusted Rand Index (ARI) in species clustering and substantially increasing the number of correctly identified species in metagenomics binning. The code, data, and pre-trained model are publicly available at https://github.com/Zhihan1996/DNABERT_S.


An Investigation into Using Unsupervised Metrics to Optimise GNNs for Node Clustering

arXiv.org Artificial Intelligence

Graph Neural Networks (GNNs) can be trained to detect communities within a graph by learning from the duality of feature and connectivity information. Currently, the common approach for optimisation of GNNs is to use comparisons to ground-truth for hyperparameter tuning and model selection. In this work, we show that nodes can be clustered into communities with GNNs by solely optimising for modularity, without any comparison to ground-truth. Although modularity is a graph partitioning quality metric, we show that this can be used to optimise GNNs that also encode features without a drop in performance. We take it a step further and also study whether the unsupervised metric performance can predict ground-truth performance. To investigate why modularity can be used to optimise GNNs, we design synthetic experiments that show the limitations of this approach. The synthetic graphs are created to highlight current capabilities in distinct, random and zero information space partitions in attributed graphs. We conclude that modularity can be used for hyperparameter optimisation and model selection on real-world datasets as well as being a suitable proxy for predicting ground-truth performance, however, GNNs fail to balance the information duality when the spaces contain conflicting signals.


Hierarchical Position Embedding of Graphs with Landmarks and Clustering for Link Prediction

arXiv.org Artificial Intelligence

Learning positional information of nodes in a graph is important for link prediction tasks. We propose a representation of positional information using representative nodes called landmarks. A small number of nodes with high degree centrality are selected as landmarks, which serve as reference points for the nodes' positions. We justify this selection strategy for well-known random graph models and derive closed-form bounds on the average path lengths involving landmarks. In a model for power-law graphs, we prove that landmarks provide asymptotically exact information on inter-node distances. We apply theoretical insights to practical networks and propose Hierarchical Position embedding with Landmarks and Clustering (HPLC). HPLC combines landmark selection and graph clustering, where the graph is partitioned into densely connected clusters in which nodes with the highest degree are selected as landmarks. HPLC leverages the positional information of nodes based on landmarks at various levels of hierarchy such as nodes' distances to landmarks, inter-landmark distances and hierarchical grouping of clusters. Experiments show that HPLC achieves state-of-the-art performances of link prediction on various datasets in terms of HIT@K, MRR, and AUC. The code is available at \url{https://github.com/kmswin1/HPLC}.


Randomized Algorithms for Symmetric Nonnegative Matrix Factorization

arXiv.org Artificial Intelligence

We propose the first randomized algorithms for Symmetric Nonnegative Matrix Factorization (SymNMF). Nonnegative Matrix Factorization (NMF) is an important method in data analysis with applications to data visualization, text mining, feature learning, information fusion and more [28, 25, 46, 22, 11]. SymNMF is a variant of NMF where the input matrix is symmetric and the output low-rank approximation is also constrained to be symmetric [25, 49]. Applications of SymNMF include (hyper)graph clustering, image segmentation, and information fusion [44, 10, 19, 5, 6]. Several randomized algorithms for nonsymmetric NMF have been previously proposed and shown to be effective for dense and small sparse problems [43, 41, 13], but as far as we are aware there is no prior work on randomized algorithms for SymNMF. Our contributions in this work include a randomized algorithm for SymNMF we call "Low-rank Approximated Input SymNMF" (LAI-SymNMF), a randomized algorithm based on leverage score sampling for least squares problems we call LvS-SymNMF, novel theoretical analysis of leverage score sampling for the Nonnegative Least Squares problem and theoretical analysis of a hybrid sampling scheme for leverage score sampling. The rest of the paper is organized as follows. Section 2, which discusses background material including non-randomized SymNMF algorithms, reviews existing randomized NMF methods and other related work such as randomized methods for other low-rank matrix decompositions and tensor decompositions.


Label-Efficient Model Selection for Text Generation

arXiv.org Artificial Intelligence

Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models. DiffUse reduces the required amount of preference annotations, thus saving valuable time and resources in performing evaluation. DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations -- by up to 75% -- while maintaining high evaluation reliability.


ClusterTabNet: Supervised clustering method for table detection and table structure recognition

arXiv.org Artificial Intelligence

We present a novel deep-learning-based method to cluster words in documents which we apply to detect and recognize tables given the OCR output. We interpret table structure bottom-up as a graph of relations between pairs of words (belonging to the same row, column, header, as well as to the same table) and use a transformer encoder model to predict its adjacency matrix. We demonstrate the performance of our method on the PubTables-1M dataset as well as PubTabNet and FinTabNet datasets. Compared to the current state-of-the-art detection methods such as DETR and Faster R-CNN, our method achieves similar or better accuracy, while requiring a significantly smaller model.


Modeling and predicting students' engagement behaviors using mixture Markov models

arXiv.org Artificial Intelligence

Students' engagements reflect their level of involvement in an ongoing learning process which can be estimated through their interactions with a computer-based learning or assessment system. A pre-requirement for stimulating student engagement lies in the capability to have an approximate representation model for comprehending students' varied (dis)engagement behaviors. In this paper, we utilized model-based clustering for this purpose which generates K mixture Markov models to group students' traces containing their (dis)engagement behavioral patterns. To prevent the Expectation-Maximization (EM) algorithm from getting stuck in a local maxima, we also introduced a K-means-based initialization method named as K-EM. We performed an experimental work on two real datasets using the three variants of the EM algorithm: the original EM, emEM, K-EM; and, non-mixture baseline models for both datasets. The proposed K-EM has shown very promising results and achieved significant performance difference in comparison with the other approaches particularly using the Dataset. Hence, we suggest to perform further experiments using large dataset(s) to validate our method. Additionally, visualization of the resultant clusters through first-order Markov chains reveals very useful insights about (dis)engagement behaviors depicted by the students. We conclude the paper with a discussion on the usefulness of our approach, limitations and potential extensions of this work.


Clustering Techniques Selection for a Hybrid Regression Model: A Case Study Based on a Solar Thermal System

arXiv.org Artificial Intelligence

This work addresses the performance comparison between four clustering techniques with the objective of achieving strong hybrid models in supervised learning tasks. A real dataset from a bio-climatic house named Sotavento placed on experimental wind farm and located in Xermade (Lugo) in Galicia (Spain) has been collected. Authors have chosen the thermal solar generation system in order to study how works applying several cluster methods followed by a regression technique to predict the output temperature of the system. With the objective of defining the quality of each clustering method two possible solutions have been implemented. The first one is based on three unsupervised learning metrics (Silhouette, Calinski-Harabasz and Davies-Bouldin) while the second one, employs the most common error measurements for a regression algorithm such as Multi Layer Perceptron.


A Scalable Algorithm for Individually Fair K-means Clustering

arXiv.org Artificial Intelligence

We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al. Given $n$ points $P$ in a metric space, let $\delta(x)$ for $x\in P$ be the radius of the smallest ball around $x$ containing at least $n / k$ points. A clustering is then called individually fair if it has centers within distance $\delta(x)$ of $x$ for each $x\in P$. While good approximation algorithms are known for this problem no efficient practical algorithms with good theoretical guarantees have been presented. We design the first fast local-search algorithm that runs in ~$O(nk^2)$ time and obtains a bicriteria $(O(1), 6)$ approximation. Then we show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.