Goto

Collaborating Authors

 Clustering


Combinatorial Approximations for Cluster Deletion: Simpler, Faster, and Better

arXiv.org Artificial Intelligence

Graph clustering is a fundamental task in graph mining where the goal is to partition nodes of a graph into disjoint clusters that have dense internal connections but are only sparsely connected to the rest of the graph. This has a wide variety of applications which include detecting communities in social networks [Fortunato, 2010], identifying related genes in biological networks based on gene expression profiles [Ben-Dor et al., 1999], and finding groups of pixels in an image that belong to the same object [Shi and Malik, 2000]. An idealized notion of a cluster in a graph is a set of nodes that is completely connected internally (i.e., a clique) while being completely disconnected from the rest of the graph. Cluster graph modification problems [Shamir et al., 2004] are a class of graph clustering objectives that seek to edit the edges in a graph as little as possible in order to achieve this idealized structure. One widely studied problem is correlation clustering [Bansal et al., 2004], which can be cast as adding or deleting a minimum number of edges to convert a graph into a disjoint union of cliques. This problem is also known as cluster editing.


Beyond ESM2: Graph-Enhanced Protein Sequence Modeling with Efficient Clustering

arXiv.org Artificial Intelligence

Proteins are essential to life's processes, underpinning evolution and diversity. Advances in sequencing technology have revealed millions of proteins, underscoring the need for sophisticated pre-trained protein models for biological analysis and AI development. Facebook's ESM2, the most advanced protein language model to date, leverages a masked prediction task for unsupervised learning, crafting amino acid representations with notable biochemical accuracy. Yet, it lacks in delivering functional protein insights, signaling an opportunity for enhancing representation quality.Our study addresses this gap by incorporating protein family classification into ESM2's training.This approach, augmented with Community Propagation-Based Clustering Algorithm, improves global protein representations, while a contextual prediction task fine-tunes local amino acid accuracy. Significantly, our model achieved state-of-the-art results in several downstream experiments, demonstrating the power of combining global and local methodologies to substantially boost protein representation quality.


Enhancing Diagnosis through AI-driven Analysis of Reflectance Confocal Microscopy

arXiv.org Artificial Intelligence

Reflectance Confocal Microscopy (RCM) marks a paradigm shift in biomedical imaging, offering a sophisticated, non-invasive technique to acquire high-resolution images of the skin and superficial tissues. Its development [1] represents a milestone in medical imaging, transitioning from early exploratory stages to becoming a cornerstone in clinical dermatology. RCM's capability for in vivo imaging, capturing live tissue images without the need for biopsies or tissue excision, has made it an indispensable tool in modern medical diagnostics. The inception of RCM can be traced back to its early conceptualization, where the need for less invasive, more accurate diagnostic methods in dermatology was recognized. Over the years, the technology has undergone significant advancements, evolving in its design and functionality. This evolution has been marked by improvements in laser source quality, detector sensitivity, and image processing algorithms, resulting in enhanced image clarity and depth of tissue analysis. RCM's operation relies on a focused laser light to illuminate the target tissue. The tissue interaction with this light, primarily through backscattering and reflection, forms the basis of image creation.


SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning

arXiv.org Artificial Intelligence

The pre-trained Large Language Models (LLMs) can be adapted for many downstream tasks and tailored to align with human preferences through fine-tuning. Recent studies have discovered that LLMs can achieve desirable performance with only a small amount of high-quality data, suggesting that a large amount of the data in these extensive datasets is redundant or even harmful. Identifying high-quality data from vast datasets to curate small yet effective datasets has emerged as a critical challenge. In this paper, we introduce SHED, an automated dataset refinement framework based on Shapley value for instruction fine-tuning. SHED eliminates the need for human intervention or the use of commercial LLMs. Moreover, the datasets curated through SHED exhibit transferability, indicating they can be reused across different LLMs with consistently high performance. We conduct extensive experiments to evaluate the datasets curated by SHED. The results demonstrate SHED's superiority over state-of-the-art methods across various tasks and LLMs; notably, datasets comprising only 10% of the original data selected by SHED achieve performance comparable to or surpassing that of the full datasets.


Clustering of timed sequences -- Application to the analysis of care pathways

arXiv.org Artificial Intelligence

Improving the future of healthcare starts by better understanding the current actual practices in hospitals. This motivates the objective of discovering typical care pathways from patient data. Revealing homogeneous groups of care pathways can be achieved through clustering. The difficulty in clustering care pathways, represented by sequences of timestamped events, lies in defining a semantically appropriate metric and clustering algorithms. In this article, we adapt two methods developed for time series to time sequences: the drop-DTW metric and the DBA approach for the construction of averaged time sequences. These methods are then applied in clustering algorithms to propose original and sound clustering algorithms for timed sequences. This approach is experimented with and evaluated on synthetic and real use cases.


Revealing and Utilizing In-group Favoritism for Graph-based Collaborative Filtering

arXiv.org Artificial Intelligence

When it comes to a personalized item recommendation system, It is essential to extract users' preferences and purchasing patterns. Assuming that users in the real world form a cluster and there is common favoritism in each cluster, in this work, we introduce Co-Clustering Wrapper (CCW). We compute co-clusters of users and items with co-clustering algorithms and add CF subnetworks for each cluster to extract the in-group favoritism. Combining the features from the networks, we obtain rich and unified information about users. We experimented real world datasets considering two aspects: Finding the number of groups divided according to in-group preference, and measuring the quantity of improvement of the performance.


Variational Deep Survival Machines: Survival Regression with Censored Outcomes

arXiv.org Artificial Intelligence

Survival regression aims to predict the time when an event of interest will take place, typically a death or a failure. A fully parametric method [18] is proposed to estimate the survival function as a mixture of individual parametric distributions in the presence of censoring. In this paper, We present a novel method to predict the survival time by better clustering the survival data and combine primitive distributions. We propose two variants of variational auto-encoder (VAE), discrete and continuous, to generate the latent variables for clustering input covariates. The model is trained end to end by jointly optimizing the VAE loss and regression loss. Thorough experiments on dataset SUPPORT and FLCHAIN show that our method can effectively improve the clustering result and reach competitive scores with previous methods. We demonstrate the superior result of our model prediction in the long-term. Our code is available at https://github.com/


Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have experienced widespread adoption across scientific and industrial domains due to their versatility and utility for diverse tasks. Nevertheless, deploying and serving these models at scale with optimal throughput and latency remains a significant challenge, primarily because of the high computational and memory demands associated with LLMs. To tackle this limitation, we introduce Expert Router, a system designed to orchestrate multiple expert models efficiently, thereby enhancing scalability. Expert Router is a parallel inference system with a central routing gateway that distributes incoming requests using a clustering method. This approach effectively partitions incoming requests among available LLMs, maximizing overall throughput. Our extensive evaluations encompassed up to 1,000 concurrent users, providing comprehensive insights into the system's behavior from user and infrastructure perspectives. The results demonstrate Expert Router's effectiveness in handling high-load scenarios and achieving higher throughput rates, particularly under many concurrent users.


STROOBnet Optimization via GPU-Accelerated Proximal Recurrence Strategies

arXiv.org Artificial Intelligence

Spatiotemporal networks' observational capabilities are crucial for accurate data gathering and informed decisions across multiple sectors. This study focuses on the Spatiotemporal Ranged Observer-Observable Bipartite Network (STROOBnet), linking observational nodes (e.g., surveillance cameras) to events within defined geographical regions, enabling efficient monitoring. Using data from Real-Time Crime Camera (RTCC) systems and Calls for Service (CFS) in New Orleans, where RTCC combats rising crime amidst reduced police presence, we address the network's initial observational imbalances. Aiming for uniform observational efficacy, we propose the Proximal Recurrence approach. It outperformed traditional clustering methods like k-means and DBSCAN by offering holistic event frequency and spatial consideration, enhancing observational coverage.


Research on Robot Path Planning Based on Reinforcement Learning

arXiv.org Artificial Intelligence

This project has conducted research on robot path planning based on Visual SLAM. The main work of this project is as follows: (1) Construction of Visual SLAM system. Research has been conducted on the basic architecture of Visual SLAM. A Visual SLAM system is developed based on ORB-SLAM3 system, which can conduct dense point cloud mapping. (2) The map suitable for two-dimensional path planning is obtained through map conversion. This part converts the dense point cloud map obtained by Visual SLAM system into an octomap and then performs projection transformation to the grid map. The map conversion converts the dense point cloud map containing a large amount of redundant map information into an extremely lightweight grid map suitable for path planning. (3) Research on path planning algorithm based on reinforcement learning. This project has conducted experimental comparisons between the Q-learning algorithm, the DQN algorithm, and the SARSA algorithm, and found that DQN is the algorithm with the fastest convergence and best performance in high-dimensional complex environments. This project has conducted experimental verification of the Visual SLAM system in a simulation environment. The experimental results obtained based on open-source dataset and self-made dataset prove the feasibility and effectiveness of the designed Visual SLAM system. At the same time, this project has also conducted comparative experiments on the three reinforcement learning algorithms under the same experimental condition to obtain the optimal algorithm under the experimental condition.