Goto

Collaborating Authors

 Performance Analysis


Uncovering symmetric and asymmetric species associations from community and environmental data

arXiv.org Machine Learning

There is no much doubt that biotic interactions shape community assembly and ultimately the spatial co-variations between species. There is a hope that the signal of these biotic interactions can be observed and retrieved by investigating the spatial associations between species while accounting for the direct effects of the environment. By definition, biotic interactions can be both symmetric and asymmetric. Yet, most models that attempt to retrieve species associations from co-occurrence or co-abundance data internally assume symmetric relationships between species. Here, we propose and validate a machine-learning framework able to retrieve bidirectional associations by analyzing species community and environmental data. Our framework (1) models pairwise species associations as directed influences from a source to a target species, parameterized with two species-specific latent embeddings: the effect of the source species on the community, and the response of the target species to the community; and (2) jointly fits these associations within a multi-species conditional generative model with different modes of interactions between environmental drivers and biotic associations. Using both simulated and empirical data, we demonstrate the ability of our framework to recover known asymmetric and symmetric associations and highlight the properties of the learned association networks. By comparing our approach to other existing models such as joint species distribution models and probabilistic graphical models, we show its superior capacity at retrieving symmetric and asymmetric interactions. The framework is intuitive, modular and broadly applicable across various taxonomic groups.


Robust Spatiotemporal Epidemic Modeling with Integrated Adaptive Outlier Detection

arXiv.org Machine Learning

In epidemic modeling, outliers can distort parameter estimation and ultimately lead to misguided public health decisions. Although there are existing robust methods that can mitigate this distortion, the ability to simultaneously detect outliers is equally vital for identifying potential disease hotspots. In this work, we introduce a robust spatiotemporal generalized additive model (RST-GAM) to address this need. We accomplish this with a mean-shift parameter to quantify and adjust for the effects of outliers and rely on adaptive Lasso regularization to model the sparsity of outlying observations. We use univariate polynomial splines and bivariate penalized splines over triangulations to estimate the functional forms and a data-thinning approach for data-adaptive weight construction. We derive a scalable proximal algorithm to estimate model parameters by minimizing a convex negative log-quasi-likelihood function. Our algorithm uses adaptive step-sizes to ensure global convergence of the resulting iterate sequence. We establish error bounds and selection consistency for the estimated parameters and demonstrate our model's effectiveness through numerical studies under various outlier scenarios. Finally, we demonstrate the practical utility of RST-GAM by analyzing county-level COVID-19 infection data in the United States, highlighting its potential to inform public health decision-making.


Simulating Biases for Interpretable Fairness in Offline and Online Classifiers

arXiv.org Machine Learning

Predictive models often reinforce biases which were originally embedded in their training data, through skewed decisions. In such cases, mitigation methods are critical to ensure that, regardless of the prevailing disparities, model outcomes are adjusted to be fair. To assess this, datasets could be systematically generated with specific biases, to train machine learning classifiers. Then, predictive outcomes could aid in the understanding of this bias embedding process. Hence, an agent-based model (ABM), depicting a loan application process that represents various systemic biases across two demographic groups, was developed to produce synthetic datasets. Then, by applying classifiers trained on them to predict loan outcomes, we can assess how biased data leads to unfairness. This highlights a main contribution of this work: a framework for synthetic dataset generation with controllable bias injection. We also contribute with a novel explainability technique, which shows how mitigations affect the way classifiers leverage data features, via second-order Shapley values. In experiments, both offline and online learning approaches are employed. Mitigations are applied at different stages of the modelling pipeline, such as during pre-processing and in-processing.


Information Must Flow: Recursive Bootstrapping for Information Bottleneck in Optimal Transport

arXiv.org Machine Learning

We present the Context-Content Uncertainty Principle (CCUP), a unified framework that models cognition as the directed flow of information between high-entropy context and low-entropy content. Inference emerges as a cycle of bidirectional interactions, bottom-up contextual disambiguation paired with top-down content reconstruction, which resolves the Information Bottleneck in Optimal Transport (iBOT). Implemented via Rao-Blackwellized variational entropy minimization, CCUP steers representations toward minimal joint uncertainty while preserving inferential directionality. Local cycle completion underpins temporal bootstrapping, chaining simulations to refine memory, and spatial bootstrapping, enabling compositional hierarchical inference. We prove a Delta Convergence Theorem showing that recursive entropy minimization yields delta-like attractors in latent space, stabilizing perceptual schemas and motor plans. Temporal bootstrapping through perception-action loops and sleep-wake consolidation further transforms episodic traces into semantic knowledge. Extending CCUP, each hierarchical level performs delta-seeded inference: low-entropy content seeds diffuse outward along goal-constrained paths shaped by top-down priors and external context, confining inference to task-relevant manifolds and circumventing the curse of dimensionality. Building on this, we propose that language emerges as a symbolic transport system, externalizing latent content to synchronize inference cycles across individuals. Together, these results establish iBOT as a foundational principle of information flow in both individual cognition and collective intelligence, positioning recursive inference as the structured conduit through which minds adapt, align, and extend.


Towards High Supervised Learning Utility Training Data Generation: Data Pruning and Column Reordering

arXiv.org Machine Learning

Tabular data synthesis for supervised learning ('SL') model training is gaining popularity in industries such as healthcare, finance, and retail. Despite the progress made in tabular data generators, models trained with synthetic data often underperform compared to those trained with original data. This low SL utility of synthetic data stems from class imbalance exaggeration and SL data relationship overlooked by tabular generator. To address these challenges, we draw inspirations from techniques in emerging data-centric artificial intelligence and elucidate Pruning and ReOrdering ('PRRO'), a novel pipeline that integrates data-centric techniques into tabular data synthesis. PRRO incorporates data pruning to guide the table generator towards observations with high signal-to-noise ratio, ensuring that the class distribution of synthetic data closely matches that of the original data. Besides, PRRO employs a column reordering algorithm to align the data modeling structure of generators with that of SL models. These two modules enable PRRO to optimize SL utility of synthetic data. Empirical experiments on 22 public datasets show that synthetic data generated using PRRO enhances predictive performance compared to data generated without PRRO. Specifically, synthetic replacement of original data yields an average improvement of 26.74% and up to 871.46% improvement using PRRO, while synthetic appendant to original data results with PRRO-generated data results in an average improvement of 6.13% and up to 200.32%. Furthermore, experiments on six highly imbalanced datasets show that PRRO enables the generator to produce synthetic data with a class distribution that resembles the original data more closely, achieving a similarity improvement of 43%. Through PRRO, we foster a seamless integration of data synthesis to subsequent SL prediction, promoting quality and accessible data analysis.


Can You Detect the Difference?

arXiv.org Artificial Intelligence

The rapid advancement of large language models (LLMs) has raised concerns about reliably detecting AI-generated text. Stylometric metrics work well on autoregressive (AR) outputs, but their effectiveness on diffusion-based models is unknown. We present the first systematic comparison of diffusion-generated text (LLaDA) and AR-generated text (LLaMA) using 2 000 samples. Perplexity, burstiness, lexical diversity, readability, and BLEU/ROUGE scores show that LLaDA closely mimics human text in perplexity and burstiness, yielding high false-negative rates for AR-oriented detectors. LLaMA shows much lower perplexity but reduced lexical fidelity. Relying on any single metric fails to separate diffusion outputs from human writing. We highlight the need for diffusion-aware detectors and outline directions such as hybrid models, diffusion-specific stylometric signatures, and robust watermarking.


Privacy-Preserving Multi-Stage Fall Detection Framework with Semi-supervised Federated Learning and Robotic Vision Confirmation

arXiv.org Artificial Intelligence

The aging population is growing rapidly, and so is the danger of falls in older adults. A major cause of injury is falling, and detection in time can greatly save medical expenses and recovery time. However, to provide timely intervention and avoid unnecessary alarms, detection systems must be effective and reliable while addressing privacy concerns regarding the user. In this work, we propose a framework for detecting falls using several complementary systems: a semi-supervised federated learning-based fall detection system (SF2D), an indoor localization and navigation system, and a vision-based human fall recognition system. A wearable device and an edge device identify a fall scenario in the first system. On top of that, the second system uses an indoor localization technique first to localize the fall location and then navigate a robot to inspect the scenario. A vision-based detection system running on an edge device with a mounted camera on a robot is used to recognize fallen people. Each of the systems of this proposed framework achieves different accuracy rates. Specifically, the SF2D has a 0.81% failure rate equivalent to 99.19% accuracy, while the vision-based fallen people detection achieves 96.3% accuracy. However, when we combine the accuracy of these two systems with the accuracy of the navigation system (95% success rate), our proposed framework creates a highly reliable performance for fall detection, with an overall accuracy of 99.99%. Not only is the proposed framework safe for older adults, but it is also a privacy-preserving solution for detecting falls.


SentiDrop: A Multi Modal Machine Learning model for Predicting Dropout in Distance Learning

arXiv.org Artificial Intelligence

School dropout is a serious problem in distance learning, where early detection is crucial for effective intervention and student perseverance. Predicting student dropout using available educational data is a widely researched topic in learning analytics. Our partner's distance learning platform highlights the importance of integrating diverse data sources, including socio-demographic data, behavioral data, and sentiment analysis, to accurately predict dropout risks. In this paper, we introduce a novel model that combines sentiment analysis of student comments using the Bidirectional Encoder Representations from Transformers (BERT) model with socio-demographic and behavioral data analyzed through Extreme Gradient Boosting (XGBoost). We fine-tuned BERT on student comments to capture nuanced sentiments, which were then merged with key features selected using feature importance techniques in XGBoost. Our model was tested on unseen data from the next academic year, achieving an accuracy of 84%, compared to 82% for the baseline model. Additionally, the model demonstrated superior performance in other metrics, such as precision and F1-score. The proposed method could be a vital tool in developing personalized strategies to reduce dropout rates and encourage student perseverance.


Representation learning with a transformer by contrastive learning for money laundering detection

arXiv.org Artificial Intelligence

The present work tackles the money laundering detection problem. A new procedure is introduced which exploits structured time series of both qualitative and quantitative data by means of a transformer neural network. The first step of this procedure aims at learning representations of time series through contrastive learning (without any labels). The second step leverages these representations to generate a money laundering scoring of all observations. A two-thresholds approach is then introduced, which ensures a controlled false-positive rate by means of the Benjamini-Hochberg (BH) procedure. Experiments confirm that the transformer is able to produce general representations that succeed in exploiting money laundering patterns with minimal supervision from domain experts. It also illustrates the higher ability of the new procedure for detecting nonfraudsters as well as fraudsters, while keeping the false positive rate under control. This greatly contrasts with rule-based procedures or the ones based on LSTM architectures.


Bias-Aware Mislabeling Detection via Decoupled Confident Learning

arXiv.org Artificial Intelligence

Reliable data is a cornerstone of modern organizational systems. A notable data integrity challenge stems from label bias, which refers to systematic errors in a label, a covariate that is central to a quantitative analysis, such that its quality differs across social groups. This type of bias has been conceptually and empirically explored and is widely recognized as a pressing issue across critical domains. However, effective methodologies for addressing it remain scarce. In this work, we propose Decoupled Confident Learning (DeCoLe), a principled machine learning based framework specifically designed to detect mislabeled instances in datasets affected by label bias, enabling bias aware mislabelling detection and facilitating data quality improvement. We theoretically justify the effectiveness of DeCoLe and evaluate its performance in the impactful context of hate speech detection, a domain where label bias is a well documented challenge. Empirical results demonstrate that DeCoLe excels at bias aware mislabeling detection, consistently outperforming alternative approaches for label error detection. Our work identifies and addresses the challenge of bias aware mislabeling detection and offers guidance on how DeCoLe can be integrated into organizational data management practices as a powerful tool to enhance data reliability.