ari
Common Structure Discovery in Collections of Bipartite Networks: Application to Pollination Systems
Lacoste, Louis, Barbillon, Pierre, Donnet, Sophie
Bipartite networks are widely used to encode the ecological interactions. Being able to compare the organization of bipartite networks is a first step toward a better understanding of how environmental factors shape community structure and resilience. Yet current methods for structure detection in bipartite networks overlook shared patterns across collections of networks. We introduce the \emph{colBiSBM}, a family of probabilistic models for collections of bipartite networks that extends the classical Latent Block Model (LBM). The proposed framework assumes that networks are independent realizations of a shared mesoscale structure, encoded through common inter-block connectivity parameters. We establish identifiability conditions for the different variants of \emph{colBiSBM} and develop a variational EM algorithm for parameter estimation, coupled with an adaptation of the Integrated Classification Likelihood (ICL) criterion for model selection. We demonstrate how our approach can be used to classify networks based on their topology or organization. Simulation studies highlight the ability of \emph{colBiSBM} to recover common structures, improve clustering performance, and enhance link prediction by borrowing strength across networks. An application to plant--pollinator networks highlights how the method uncovers shared ecological roles and partitions networks into sub-collections with similar connectivity patterns. These results illustrate the methodological and practical advantages of joint modeling over separate network analyses in the study of bipartite systems.
Layer Probing Improves Kinase Functional Prediction with Protein Language Models
Kumar, Ajit, Jha, IndraPrakash
Protein language models (PLMs) have transformed sequence-based protein analysis, yet most applications rely only on final-layer embeddings, which may overlook biologically meaningful information encoded in earlier layers. We systematically evaluate all 33 layers of ESM-2 for kinase functional prediction using both unsupervised clustering and supervised classification. We show that mid-to-late transformer layers (layers 20-33) outperform the final layer by 32 percent in unsupervised Adjusted Rand Index and improve homology-aware supervised accuracy to 75.7 percent. Domain-level extraction, calibrated probability estimates, and a reproducible benchmarking pipeline further strengthen reliability. Our results demonstrate that transformer depth contains functionally distinct biological signals and that principled layer selection significantly improves kinase function prediction.
Multivariate Variational Autoencoder
Learning latent representations that are simultaneously expressive, geometrically well-structured, and reliably calibrated remains a central challenge for Variational Autoencoders (VAEs). Standard VAEs typically assume a diagonal Gaussian posterior, which simplifies optimization but rules out correlated uncertainty and often yields entangled or redundant latent dimensions. We introduce the Multivariate Variational Autoencoder (MVAE), a tractable full-covariance extension of the VAE that augments the encoder with sample-specific diagonal scales and a global coupling matrix. This induces a multivariate Gaussian posterior of the form $N(ฮผ_ฯ(x), C \operatorname{diag}(ฯ_ฯ^2(x)) C^\top)$, enabling correlated latent factors while preserving a closed-form KL divergence and a simple reparameterization path. Beyond likelihood, we propose a multi-criterion evaluation protocol that jointly assesses reconstruction quality (MSE, ELBO), downstream discrimination (linear probes), probabilistic calibration (NLL, Brier, ECE), and unsupervised structure (NMI, ARI). Across Larochelle-style MNIST variants, Fashion-MNIST, and CIFAR-10/100, MVAE consistently matches or outperforms diagonal-covariance VAEs of comparable capacity, with particularly notable gains in calibration and clustering metrics at both low and high latent dimensions. Qualitative analyses further show smoother, more semantically coherent latent traversals and sharper reconstructions. All code, dataset splits, and evaluation utilities are released to facilitate reproducible comparison and future extensions of multivariate posterior models.
A Hybrid Computational Intelligence Framework for scRNA-seq Imputation: Integrating scRecover and Random Forests
Anaissi, Ali, Liu, Deshao, Jia, Yuanzhe, Huang, Weidong, Alyassine, Widad, Akram, Junaid
Single-cell RNA sequencing (scRNA-seq) enables transcrip-tomic profiling at cellular resolution but suffers from pervasive dropout events that obscure biological signals. We present SCR-MF, a modular two-stage workflow that combines principled dropout detection using scRecover with robust non-parametric imputation via missForest. Across public and simulated datasets, SCR-MF achieves robust and interpretable performance comparable to or exceeding existing imputation methods in most cases, while preserving biological fidelity and transparency. Runtime analysis demonstrates that SCR-MF provides a competitive balance between accuracy and computational efficiency, making it suitable for mid-scale single-cell datasets.
EVINGCA: Adaptive Graph Clustering with Evolving Neighborhood Statistics
Abstract--Clustering algorithms often rely on restrictive assumptions: K-Means and Gaussian Mixtures presuppose convex, Gaussian-like clusters, while DBSCAN and HDBSCAN capture non-convexity but can be highly sensitive. I introduce EVINGCA (Evolving V ariance-Informed Nonparametric Graph Construction Algorithm), a density-variance based clustering algorithm that treats cluster formation as an adaptive, evolving process on a nearest-neighbor graph. EVINGCA expands rooted graphs via breadth-first search, guided by continuously updated local distance and shape statistics, replacing fixed density thresholds with local statistical feedback. With spatial indexing, EVINGCA features log-linear complexity in the average case and exhibits competitive performance against baselines across a variety of synthetic, real-world, low-d, and high-d datasets. Clustering is central to unsupervised learning, yet classical algorithms face significant structural and scalability limits. Centroid-based methods such as K-Means [19] assume convex, linearly separable clusters, while density-based approaches like DBSCAN [8] or HDBSCAN [4], [21] often struggle under heterogeneous densities and are highly sensitive in higher dimensionality. Graph-based and deep clustering methods offer stronger performance but often demand heavy tuning or incur prohibitive computational cost. I propose EVINGCA (Evolving V ariance-Informed Nonparametric Graph Construction Algorithm), an alternative clustering paradigm that models cluster formation as an adaptive, evolving process on a nearest-neighbor graph.
GCAO: Group-driven Clustering via Gravitational Attraction and Optimization
Traditional clustering algorithms often struggle with high-dimensional and non-uniformly distributed data, where low-density boundary samples are easily disturbed by neighboring clusters, leading to unstable and distorted clustering results. To address this issue, we propose a Group-driven Clustering via Gravitational Attraction and Optimization (GCAO) algorithm. GCAO introduces a group-level optimization mechanism that aggregates low-density boundary points into collaboratively moving groups, replacing the traditional point-based contraction process. By combining local density estimation with neighborhood topology, GCAO constructs effective gravitational interactions between groups and their surroundings, enhancing boundary clarity and structural consistency. Using groups as basic motion units, a gravitational contraction strategy ensures globally stable and directionally consistent convergence. Experiments on multiple high-dimensional datasets demonstrate that GCAO outperforms 11 representative clustering methods, achieving average improvements of 37.13%, 52.08%, 44.98%, and 38.81% in NMI, ARI, Homogeneity, and ACC, respectively, while maintaining competitive efficiency and scalability. These results highlight GCAO's superiority in preserving cluster integrity, enhancing boundary separability, and ensuring robust performance on complex data distributions.
Unsupervised Document and Template Clustering using Multimodal Embeddings
Sampaio, Phillipe R., Maxcici, Helene
We study unsupervised clustering of documents at both the category and template levels using frozen multimodal encoders and classical clustering algorithms. We systematize a model-agnostic pipeline that (i) projects heterogeneous last-layer states from text-layout-vision encoders into token-type-aware document vectors and (ii) performs clustering with centroid- or density-based methods, including an HDBSCAN + $k$-NN assignment to eliminate unlabeled points. We evaluate eight encoders (text-only, layout-aware, vision-only, and vision-language) with $k$-Means, DBSCAN, HDBSCAN + $k$-NN, and BIRCH on five corpora spanning clean synthetic invoices, their heavily degraded print-and-scan counterparts, scanned receipts, and real identity and certificate documents. The study reveals modality-specific failure modes and a robustness-accuracy trade-off, with vision features nearly solving template discovery on clean pages while text dominates under covariate shift, and fused encoders offering the best balance. We detail a reproducible, oracle-free tuning protocol and the curated evaluation settings to guide future work on unsupervised document organization.