Goto

Collaborating Authors

 South America


Improving statistical learning methods via features selection without replacement sampling and random projection

arXiv.org Machine Learning

Cancer is fundamentally a genetic disease characterized by genetic and epigenetic alterations that disrupt normal gene expression, leading to uncontrolled cell growth and metastasis. High-dimensional microarray datasets pose challenges for classification models due to the "small n, large p" problem, resulting in overfitting. This study makes three different key contributions: 1) we propose a machine learning-based approach integrating the Feature Selection Without Re-placement (FSWOR) technique and a projection method to improve classification accuracy. 2) We apply the Kendall statistical test to identify the most significant genes from the brain cancer mi-croarray dataset (GSE50161), reducing the feature space from 54,675 to 20,890 genes.3) we apply machine learning models using k-fold cross validation techniques in which our model incorpo-rates ensemble classifiers with LDA projection and Naïve Bayes, achieving a test score of 96%, outperforming existing methods by 9.09%. The results demonstrate the effectiveness of our ap-proach in high-dimensional gene expression analysis, improving classification accuracy while mitigating overfitting. This study contributes to cancer biomarker discovery, offering a robust computational method for analyzing microarray data.


To Each Metric Its Decoding: Post-Hoc Optimal Decision Rules of Probabilistic Hierarchical Classifiers

arXiv.org Machine Learning

Hierarchical classification offers an approach to incorporate the concept of mistake severity by leveraging a structured, labeled hierarchy. However, decoding in such settings frequently relies on heuristic decision rules, which may not align with task-specific evaluation metrics. In this work, we propose a framework for the optimal decoding of an output probability distribution with respect to a target metric. We derive optimal decision rules for increasingly complex prediction settings, providing universal algorithms when candidates are limited to the set of nodes. In the most general case of predicting a subset of nodes, we focus on rules dedicated to the hierarchical $hF_β$ scores, tailored to hierarchical settings. To demonstrate the practical utility of our approach, we conduct extensive empirical evaluations, showcasing the superiority of our proposed optimal strategies, particularly in underdetermined scenarios. These results highlight the potential of our methods to enhance the performance and reliability of hierarchical classifiers in real-world applications. The code is available at https://github.com/RomanPlaud/hierarchical_decision_rules


Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model

arXiv.org Machine Learning

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, when viewed as `loss + regularization,' the standard RLHF objective lacks theoretical justification and incentivizes degenerate, deterministic solutions, an issue that variants such as Direct Policy Optimization (DPO) also inherit. In this paper, we rethink alignment by framing it as \emph{distribution learning} from pairwise preference feedback by explicitly modeling how information about the target language model bleeds through the preference data. This explicit modeling leads us to propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization. We theoretically show that all three approaches enjoy strong non-asymptotic $O(1/n)$ convergence to the target language model, naturally avoiding degeneracy and reward overfitting. Finally, we empirically demonstrate that our distribution learning framework, especially preference distillation, consistently outperforms or matches the performances of RLHF and DPO across various tasks and models.


"Who experiences large model decay and why?" A Hierarchical Framework for Diagnosing Heterogeneous Performance Drift

arXiv.org Machine Learning

Machine learning (ML) models frequently experience performance degradation when deployed in new contexts. Such degradation is rarely uniform: some subgroups may suffer large performance decay while others may not. Understanding where and how large differences in performance arise is critical for designing targeted corrective actions that mitigate decay for the most affected subgroups while minimizing any unintended effects. Current approaches do not provide such detailed insight, as they either (i) explain how average performance shifts arise or (ii) identify adversely affected subgroups without insight into how this occurred. To this end, we introduce a Subgroup-scanning Hierarchical Inference Framework for performance drifT (SHIFT). SHIFT first asks "Is there any subgroup with unacceptably large performance decay due to covariate/outcome shifts?" (Where?) and, if so, dives deeper to ask "Can we explain this using more detailed variable(subset)-specific shifts?" (How?). In real-world experiments, we find that SHIFT identifies interpretable subgroups affected by performance decay, and suggests targeted actions that effectively mitigate the decay.


Reasoning-Table: Exploring Reinforcement Learning for Table Reasoning

arXiv.org Artificial Intelligence

Table reasoning, encompassing tasks such as table question answering, fact verification, and text-to-SQL, requires precise understanding of structured tabular data, coupled with numerical computation and code manipulation for effective inference. Supervised fine-tuning (SFT) approaches have achieved notable success but often struggle with generalization and robustness due to biases inherent in imitative learning. We introduce Reasoning-Table, the first application of reinforcement learning (RL) to table reasoning, achieving state-of-the-art performance. Through rigorous data preprocessing, reward design, and tailored training strategies, our method leverages simple rule-based outcome rewards to outperform SFT across multiple benchmarks. Unified training across diverse tasks enables Reasoning-Table to emerge as a robust table reasoning large language model, surpassing larger proprietary models like Claude-3.7-Sonnet by 4.0% on table reasoning benchmarks. The approach also achieves excellent performance on text-to-SQL tasks, reaching 68.3% performance on the BIRD dev dataset with a 7B model. Further experiments demonstrate that Reasoning-Table enhances the model's generalization capabilities and robustness.


Model-agnostic Mitigation Strategies of Data Imbalance for Regression

arXiv.org Artificial Intelligence

Data imbalance persists as a pervasive challenge in regression tasks, introducing bias in model performance and undermining predictive reliability. This is particularly detrimental in applications aimed at predicting rare events that fall outside the domain of the bulk of the training data. In this study, we review the current state-of-the-art regarding sampling-based methods and cost-sensitive learning. Additionally, we propose novel approaches to mitigate model bias. To better asses the importance of data, we introduce the density-distance and density-ratio relevance functions, which effectively integrate empirical frequency of data with domain-specific preferences, offering enhanced interpretability for end-users. Furthermore, we present advanced mitigation techniques (cSMOGN and crbSMOGN), which build upon and improve existing sampling methods. In a comprehensive quantitative evaluation, we benchmark state-of-the-art methods on 10 synthetic and 42 real-world datasets, using neural networks, XGBoosting trees and Random Forest models. Our analysis reveals that while most strategies improve performance on rare samples, they often degrade it on frequent ones. We demonstrate that constructing an ensemble of models -- one trained with imbalance mitigation and another without -- can significantly reduce these negative effects. The key findings underscore the superior performance of our novel crbSMOGN sampling technique with the density-ratio relevance function for neural networks, outperforming state-of-the-art methods.


Detection of Endangered Deer Species Using UAV Imagery: A Comparative Study Between Efficient Deep Learning Approaches

arXiv.org Artificial Intelligence

Personal use of this material is permitted. Abstract -- This study compares the performance of state-of-the-art neural networks including variants of the YOLOv11 and RT -DETR models for detecting marsh deer in UA V imagery, in scenarios where specimens occupy a very small portion of the image and are occluded by vegetation. We extend previous analysis adding precise segmentation masks for our datasets enabling a fine-grained training of a YOLO model with a segmentation head included. Experimental results show the effectiveness of incorporating the segmentation head achieving superior detection performance. This work contributes valuable insights for improving UA V-based wildlife monitoring and conservation strategies through scalable and accurate AI-driven detection systems. Monitoring wildlife is crucial for understanding and preserving ecosystems. Traditional observation methods, however, are often labor-intensive, costly, and constrained in coverage [1], [2].


On the Parallels Between Evolutionary Theory and the State of AI

arXiv.org Artificial Intelligence

This article critically examines the foundational principles of contemporary AI methods, exploring the limitations that hinder its potential. We draw parallels between the modern AI landscape and the 20th-century Modern Synthesis in evolutionary biology, and highlight how advancements in evolutionary theory that augmented the Modern Synthesis, particularly those of Evolutionary Developmental Biology, offer insights that can inform a new design paradigm for AI. By synthesizing findings across AI and evolutionary theory, we propose a pathway to overcome existing limitations, enabling AI to achieve its aspirational goals.


Investigating the effectiveness of multimodal data in forecasting SARS-COV-2 case surges

arXiv.org Machine Learning

The COVID-19 pandemic response relied heavily on statistical and machine learning models to predict key outcomes such as case prevalence and fatality rates. These predictions were instrumental in enabling timely public health interventions that helped break transmission cycles. While most existing models are grounded in traditional epidemiological data, the potential of alternative datasets, such as those derived from genomic information and human behavior, remains underexplored. In the current study, we investigated the usefulness of diverse modalities of feature sets in predicting case surges. Our results highlight the relative effectiveness of biological (e.g., mutations), public health (e.g., case counts, policy interventions) and human behavioral features (e.g., mobility and social media conversations) in predicting country-level case surges. Importantly, we uncover considerable heterogeneity in predictive performance across countries and feature modalities, suggesting that surge prediction models may need to be tailored to specific national contexts and pandemic phases. Overall, our work highlights the value of integrating alternative data sources into existing disease surveillance frameworks to enhance the prediction of pandemic dynamics.


CNN-LSTM Hybrid Model for AI-Driven Prediction of COVID-19 Severity from Spike Sequences and Clinical Data

arXiv.org Artificial Intelligence

The COVID-19 pandemic, caused by SARS-CoV-2, highlighted the critical need for accurate prediction of disease severity to optimize healthcare resource allocation and patient management. The spike protein, which facilitates viral entry into host cells, exhibits high mutation rates, particularly in the receptor-binding domain, influencing viral pathogenicity. Artificial intelligence approaches, such as deep learning, offer promising solutions for leveraging genomic and clinical data to predict disease outcomes. Objective: This study aimed to develop a hybrid CNN-LSTM deep learning model to predict COVID-19 severity using spike protein sequences and associated clinical metadata from South American patients. Methods: We retrieved 9,570 spike protein sequences from the GISAID database, of which 3,467 met inclusion criteria after standardization. The dataset included 2,313 severe and 1,154 mild cases. A feature engineering pipeline extracted features from sequences, while demographic and clinical variables were one-hot encoded. A hybrid CNN-LSTM architecture was trained, combining CNN layers for local pattern extraction and an LSTM layer for long-term dependency modeling. Results: The model achieved an F1 score of 82.92%, ROC-AUC of 0.9084, precision of 83.56%, and recall of 82.85%, demonstrating robust classification performance. Training stabilized at 85% accuracy with minimal overfitting. The most prevalent lineages (P.1, AY.99.2) and clades (GR, GK) aligned with regional epidemiological trends, suggesting potential associations between viral genetics and clinical outcomes. Conclusion: The CNN-LSTM hybrid model effectively predicted COVID-19 severity using spike protein sequences and clinical data, highlighting the utility of AI in genomic surveillance and precision public health. Despite limitations, this approach provides a framework for early severity prediction in future outbreaks.