AITopics

2402.08978

Country:

North America > United States > New York (0.04)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Texas (0.04)
(2 more...)

Genre:

Research Report (1.00)
Overview (0.93)

Industry:

Banking & Finance > Trading (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)
Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

arXiv.org Artificial IntelligenceFeb-14-2024

DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation Models

Zhou, Zhihan, Wu, Weimin, Ho, Harrison, Wang, Jiayi, Shi, Lizhen, Davuluri, Ramana V, Wang, Zhong, Liu, Han

Effective DNA embedding remains crucial in genomic analysis, particularly in scenarios lacking labeled data for model fine-tuning, despite the significant advancements in genome foundation models. A prime example is metagenomics binning, a critical process in microbiome research that aims to group DNA sequences by their species from a complex mixture of DNA sequences derived from potentially thousands of distinct, often uncharacterized species. To fill the lack of effective DNA embedding models, we introduce DNABERT-S, a genome foundation model that specializes in creating species-aware DNA embeddings. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C$^2$LR) strategy. Empirical results on 18 diverse datasets showed DNABERT-S's remarkable performance. It outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training while doubling the Adjusted Rand Index (ARI) in species clustering and substantially increasing the number of correctly identified species in metagenomics binning. The code, data, and pre-trained model are publicly available at https://github.com/Zhihan1996/DNABERT_S.

dna sequence, dnabert-s, sequence, (11 more...)

2402.08777

Country:

North America > United States > New York > Suffolk County > Stony Brook (0.04)
North America > United States > Illinois > Cook County > Evanston (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > United States > California > Merced County > Merced (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Leeney, William, McConville, Ryan

An Investigation into Using Unsupervised Metrics to Optimise GNNs for Node Clustering

Graph Neural Networks (GNNs) can be trained to detect communities within a graph by learning from the duality of feature and connectivity information. Currently, the common approach for optimisation of GNNs is to use comparisons to ground-truth for hyperparameter tuning and model selection. In this work, we show that nodes can be clustered into communities with GNNs by solely optimising for modularity, without any comparison to ground-truth. Although modularity is a graph partitioning quality metric, we show that this can be used to optimise GNNs that also encode features without a drop in performance. We take it a step further and also study whether the unsupervised metric performance can predict ground-truth performance. To investigate why modularity can be used to optimise GNNs, we design synthetic experiments that show the limitations of this approach. The synthetic graphs are created to highlight current capabilities in distinct, random and zero information space partitions in attributed graphs. We conclude that modularity can be used for hyperparameter optimisation and model selection on real-world datasets as well as being a suitable proxy for predicting ground-truth performance, however, GNNs fail to balance the information duality when the spaces contain conflicting signals.

algorithm, dataset, model selection, (14 more...)

2402.07845

Country:

North America > United States > Texas (0.05)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Nepal (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Kim, Minsang, Baek, Seungjun

Hierarchical Position Embedding of Graphs with Landmarks and Clustering for Link Prediction

Learning positional information of nodes in a graph is important for link prediction tasks. We propose a representation of positional information using representative nodes called landmarks. A small number of nodes with high degree centrality are selected as landmarks, which serve as reference points for the nodes' positions. We justify this selection strategy for well-known random graph models and derive closed-form bounds on the average path lengths involving landmarks. In a model for power-law graphs, we prove that landmarks provide asymptotically exact information on inter-node distances. We apply theoretical insights to practical networks and propose Hierarchical Position embedding with Landmarks and Clustering (HPLC). HPLC combines landmark selection and graph clustering, where the graph is partitioned into densely connected clusters in which nodes with the highest degree are selected as landmarks. HPLC leverages the positional information of nodes based on landmarks at various levels of hierarchy such as nodes' distances to landmarks, inter-landmark distances and hierarchical grouping of clusters. Experiments show that HPLC achieves state-of-the-art performances of link prediction on various datasets in terms of HIT@K, MRR, and AUC. The code is available at \url{https://github.com/kmswin1/HPLC}.

graph, landmark, node, (13 more...)

doi: 10.1145/3589334.3645372

2402.08174

Country:

Asia > Singapore (0.06)
Asia > South Korea > Seoul > Seoul (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Randomized Algorithms for Symmetric Nonnegative Matrix Factorization

Hayashi, Koby, Aksoy, Sinan G., Ballard, Grey, Park, Haesun

We propose the first randomized algorithms for Symmetric Nonnegative Matrix Factorization (SymNMF). Nonnegative Matrix Factorization (NMF) is an important method in data analysis with applications to data visualization, text mining, feature learning, information fusion and more [28, 25, 46, 22, 11]. SymNMF is a variant of NMF where the input matrix is symmetric and the output low-rank approximation is also constrained to be symmetric [25, 49]. Applications of SymNMF include (hyper)graph clustering, image segmentation, and information fusion [44, 10, 19, 5, 6]. Several randomized algorithms for nonsymmetric NMF have been previously proposed and shown to be effective for dense and small sparse problems [43, 41, 13], but as far as we are aware there is no prior work on randomized algorithms for SymNMF. Our contributions in this work include a randomized algorithm for SymNMF we call "Low-rank Approximated Input SymNMF" (LAI-SymNMF), a randomized algorithm based on leverage score sampling for least squares problems we call LvS-SymNMF, novel theoretical analysis of leverage score sampling for the Nonnegative Least Squares problem and theoretical analysis of a hybrid sampling scheme for leverage score sampling. The rest of the paper is organized as follows. Section 2, which discusses background material including non-randomized SymNMF algorithms, reviews existing randomized NMF methods and other related work such as randomized methods for other low-rank matrix decompositions and tensor decompositions.

algorithm, equation, leverage score, (15 more...)

2402.08134

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > China (0.04)
Africa > Senegal > Kolda Region > Kolda (0.04)
(6 more...)

Genre: Research Report (1.00)

Industry:

Education (0.46)
Health & Medicine (0.45)
Government > Regional Government > North America Government > United States Government (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Label-Efficient Model Selection for Text Generation

Ashury-Tahan, Shir, Sznajder, Benjamin, Choshen, Leshem, Ein-Dor, Liat, Shnarch, Eyal, Gera, Ariel

Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models. DiffUse reduces the required amount of preference annotations, thus saving valuable time and resources in performing evaluation. DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations -- by up to 75% -- while maintaining high evaluation reliability.

annotated example number, h-ward, naturalquestion, (14 more...)

2402.07891

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
(7 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Polewczyk, Marek, Spinaci, Marco

ClusterTabNet: Supervised clustering method for table detection and table structure recognition

We present a novel deep-learning-based method to cluster words in documents which we apply to detect and recognize tables given the OCR output. We interpret table structure bottom-up as a graph of relations between pairs of words (belonging to the same row, column, header, as well as to the same table) and use a transformer encoder model to predict its adjacency matrix. We demonstrate the performance of our method on the PubTables-1M dataset as well as PubTabNet and FinTabNet datasets. Compared to the current state-of-the-art detection methods such as DETR and Faster R-CNN, our method achieves similar or better accuracy, while requiring a significantly smaller model.

dataset, recognition, table detection, (14 more...)

2402.07502

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.64)

arXiv.org Artificial IntelligenceFeb-10-2024

Modeling and predicting students' engagement behaviors using mixture Markov models

Maqsood, R., Ceravolo, P., Romero, C., Ventura, S.

Students' engagements reflect their level of involvement in an ongoing learning process which can be estimated through their interactions with a computer-based learning or assessment system. A pre-requirement for stimulating student engagement lies in the capability to have an approximate representation model for comprehending students' varied (dis)engagement behaviors. In this paper, we utilized model-based clustering for this purpose which generates K mixture Markov models to group students' traces containing their (dis)engagement behavioral patterns. To prevent the Expectation-Maximization (EM) algorithm from getting stuck in a local maxima, we also introduced a K-means-based initialization method named as K-EM. We performed an experimental work on two real datasets using the three variants of the EM algorithm: the original EM, emEM, K-EM; and, non-mixture baseline models for both datasets. The proposed K-EM has shown very promising results and achieved significant performance difference in comparison with the other approaches particularly using the Dataset. Hence, we suggest to perform further experiments using large dataset(s) to validate our method. Additionally, visualization of the resultant clusters through first-order Markov chains reveals very useful insights about (dis)engagement behaviors depicted by the students. We conclude the paper with a discussion on the usefulness of our approach, limitations and potential extensions of this work.

algorithm, behavioral pattern, student, (13 more...)

2403.05556

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Italy > Lombardy > Milan (0.14)
Asia > Pakistan (0.04)
(5 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education > Educational Technology > Educational Software > Computer Based Training (1.00)
Education > Educational Setting > Online (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.93)

García-Ordás, María Teresa, Alaiz-Moretón, Héctor, Casteleiro-Roca, José-Luis, Jove, Esteban, Benítez-Andrades, José Alberto, García-Rodríguez, Isaías, Quintián, Héctor, Calvo-Rolle, José Luis

Clustering Techniques Selection for a Hybrid Regression Model: A Case Study Based on a Solar Thermal System

arXiv.org Artificial IntelligenceFeb-10-2024

This work addresses the performance comparison between four clustering techniques with the objective of achieving strong hybrid models in supervised learning tasks. A real dataset from a bio-climatic house named Sotavento placed on experimental wind farm and located in Xermade (Lugo) in Galicia (Spain) has been collected. Authors have chosen the thermal solar generation system in order to study how works applying several cluster methods followed by a regression technique to predict the output temperature of the system. With the objective of defining the quality of each clustering method two possible solutions have been implemented. The first one is based on three unsupervised learning metrics (Silhouette, Calinski-Harabasz and Davies-Bouldin) while the second one, employs the most common error measurements for a regression algorithm such as Multi Layer Perceptron.

algorithm, engineering, university, (17 more...)

doi: 10.1080/01969722.2022.2030006

2402.06921

Country:

Europe > Spain > Castile and León > León Province > León (0.05)
Europe > Spain > Galicia > A Coruña Province > A Coruña (0.05)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
(3 more...)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Industry: Energy > Renewable > Solar (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Bateni, MohammadHossein, Cohen-Addad, Vincent, Epasto, Alessandro, Lattanzi, Silvio

A Scalable Algorithm for Individually Fair K-means Clustering

arXiv.org Artificial IntelligenceFeb-9-2024

We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al. Given $n$ points $P$ in a metric space, let $\delta(x)$ for $x\in P$ be the radius of the smallest ball around $x$ containing at least $n / k$ points. A clustering is then called individually fair if it has centers within distance $\delta(x)$ of $x$ for each $x\in P$. While good approximation algorithms are known for this problem no efficient practical algorithms with good theoretical guarantees have been presented. We design the first fast local-search algorithm that runs in ~$O(nk^2)$ time and obtains a bicriteria $(O(1), 6)$ approximation. Then we show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.

algorithm, dataset, lattanzi and sohler, (14 more...)

2402.0673

Country:

North America > United States > California > Los Angeles County > Long Beach (0.14)
North America > United States > Arizona > Maricopa County > Phoenix (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(10 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)