Goto

Collaborating Authors

 Ensemble Learning


Design of an Edge-based Portable EHR System for Anemia Screening in Remote Health Applications

arXiv.org Artificial Intelligence

The design of medical systems for remote, resource-limited environments faces persistent challenges due to poor interoperability, lack of offline support, and dependency on costly infrastructure. Many existing digital health solutions neglect these constraints, limiting their effectiveness for frontline health workers in underserved regions. This paper presents a portable, edge-enabled Electronic Health Record platform optimized for offline-first operation, secure patient data management, and modular diagnostic integration. Running on small-form factor embedded devices, it provides AES-256 encrypted local storage with optional cloud synchronization for interoperability. As a use case, we integrated a non-invasive anemia screening module leveraging fingernail pallor analysis. Trained on 250 patient cases (27\% anemia prevalence) with KDE-balanced data, the Random Forest model achieved a test RMSE of 1.969 g/dL and MAE of 1.490 g/dL. A severity-based model reached 79.2\% sensitivity. To optimize performance, a YOLOv8n-based nail bed detector was quantized to INT8, reducing inference latency from 46.96 ms to 21.50 ms while maintaining mAP@0.5 at 0.995. The system emphasizes low-cost deployment, modularity, and data privacy compliance (HIPAA/GDPR), addressing critical barriers to digital health adoption in disconnected settings. Our work demonstrates a scalable approach to enhance portable health information systems and support frontline healthcare in underserved regions.


Automated Vigilance State Classification in Rodents Using Machine Learning and Feature Engineering

arXiv.org Artificial Intelligence

Preclinical sleep research remains constrained by labor intensive, manual vigilance state classification and inter rater variability, limiting throughput and reproducibility. This study presents an automated framework developed by Team Neural Prognosticators to classify electroencephalogram (EEG) recordings of small rodents into three critical vigilance states paradoxical sleep (REM), slow wave sleep (SWS), and wakefulness. The system integrates advanced signal processing with machine learning, leveraging engineered features from both time and frequency domains, including spectral power across canonical EEG bands (delta to gamma), temporal dynamics via Maximum-Minimum Distance, and cross-frequency coupling metrics. These features capture distinct neurophysiological signatures such as high frequency desynchronization during wakefulness, delta oscillations in SWS, and REM specific bursts. Validated during the 2024 Big Data Health Science Case Competition (University of South Carolina Big Data Health Science Center, 2024), our XGBoost model achieved 91.5% overall accuracy, 86.8% precision, 81.2% recall, and an F1 score of 83.5%, outperforming all baseline methods. Our approach represents a critical advancement in automated sleep state classification and a valuable tool for accelerating discoveries in sleep science and the development of targeted interventions for chronic sleep disorders. As a publicly available code (BDHSC) resource is set to contribute significantly to advancements.


Honesty in Causal Forests: When It Helps and When It Hurts

arXiv.org Machine Learning

Causal forests have become a popular tool for estimating how treatment effects vary across individuals (Wager and Athey, 2018). They are used in a growing number of domains--including marketing, operations, economics, and public policy--to personalize interventions and inform targeting strategies. Since 2019, dozens of papers in INFORMS journals alone have applied causal forests to experimental or observational data (see Appendix C), often with the goal of estimating individual-level treatment effects. The method builds on a familiar idea: instead of estimating a single average effect for the whole population, we split the population into subgroups based on observed features and estimate effects within each group. This is conceptually similar to how random forests estimate outcomes, except now the goal is to estimate causal effects. But there is a crucial modeling difference: unlike random forests, which typically use the full training data for both splitting and estimation, causal forests often divide the training data in two--using one part to decide how to form the subgroups, and the other to estimate effects within them. This practice, known as honest estimation, is meant to prevent overfitting and selection bias (Athey and Imbens, 2016). It is the default in widely used software packages such as grf (Athey et al., 2019) and EconML (Battocchi et al., 2019), and is commonly recommended in applied research. But is this default always a good idea? 1


Predicting Delayed Trajectories Using Network Features: A Study on the Dutch Railway Network

arXiv.org Artificial Intelligence

The Dutch railway network is one of the busiest in the world, with delays being a prominent concern for the principal passenger railway operator NS. This research addresses a gap in delay prediction studies within the Dutch railway network by employing an XGBoost Classifier with a focus on topological features. Current research predominantly emphasizes short-term predictions and neglects the broader network-wide patterns essential for mitigating ripple effects. This research implements and improves an existing methodology, originally designed to forecast the evolution of the fast-changing US air network, to predict delays in the Dutch Railways. By integrating Node Centrality Measures and comparing multiple classifiers like RandomForest, DecisionTree, GradientBoosting, AdaBoost, and LogisticRegression, the goal is to predict delayed trajectories. However, the results reveal limited performance, especially in non-simultaneous testing scenarios, suggesting the necessity for more context-specific adaptations. Regardless, this research contributes to the understanding of transportation network evaluation and proposes future directions for developing more robust predictive models for delays.


Robust Semi-Supervised CT Radiomics for Lung Cancer Prognosis: Cost-Effective Learning with Limited Labels and SHAP Interpretation

arXiv.org Artificial Intelligence

Background: CT imaging is vital for lung cancer management, offering detailed visualization for AI-based prognosis. However, supervised learning SL models require large labeled datasets, limiting their real-world application in settings with scarce annotations. Methods: We analyzed CT scans from 977 patients across 12 datasets extracting 1218 radiomics features using Laplacian of Gaussian and wavelet filters via PyRadiomics Dimensionality reduction was applied with 56 feature selection and extraction algorithms and 27 classifiers were benchmarked A semi supervised learning SSL framework with pseudo labeling utilized 478 unlabeled and 499 labeled cases Model sensitivity was tested in three scenarios varying labeled data in SL increasing unlabeled data in SSL and scaling both from 10 percent to 100 percent SHAP analysis was used to interpret predictions Cross validation and external testing in two cohorts were performed. Results: SSL outperformed SL, improving overall survival prediction by up to 17 percent. The top SSL model, Random Forest plus XGBoost classifier, achieved 0.90 accuracy in cross-validation and 0.88 externally. SHAP analysis revealed enhanced feature discriminability in both SSL and SL, especially for Class 1 survival greater than 4 years. SSL showed strong performance with only 10 percent labeled data, with more stable results compared to SL and lower variance across external testing, highlighting SSL's robustness and cost effectiveness. Conclusion: We introduced a cost-effective, stable, and interpretable SSL framework for CT-based survival prediction in lung cancer, improving performance, generalizability, and clinical readiness by integrating SHAP explainability and leveraging unlabeled data.


SentiDrop: A Multi Modal Machine Learning model for Predicting Dropout in Distance Learning

arXiv.org Artificial Intelligence

School dropout is a serious problem in distance learning, where early detection is crucial for effective intervention and student perseverance. Predicting student dropout using available educational data is a widely researched topic in learning analytics. Our partner's distance learning platform highlights the importance of integrating diverse data sources, including socio-demographic data, behavioral data, and sentiment analysis, to accurately predict dropout risks. In this paper, we introduce a novel model that combines sentiment analysis of student comments using the Bidirectional Encoder Representations from Transformers (BERT) model with socio-demographic and behavioral data analyzed through Extreme Gradient Boosting (XGBoost). We fine-tuned BERT on student comments to capture nuanced sentiments, which were then merged with key features selected using feature importance techniques in XGBoost. Our model was tested on unseen data from the next academic year, achieving an accuracy of 84%, compared to 82% for the baseline model. Additionally, the model demonstrated superior performance in other metrics, such as precision and F1-score. The proposed method could be a vital tool in developing personalized strategies to reduce dropout rates and encourage student perseverance.


Cryptogenic stroke and migraine: using probabilistic independence and machine learning to uncover latent sources of disease from the electronic health record

arXiv.org Artificial Intelligence

Migraine is a common but complex neurological disorder that doubles the lifetime risk of cryptogenic stroke (CS). However, this relationship remains poorly characterized, and few clinical guidelines exist to reduce this associated risk. We therefore propose a data-driven approach to extract probabilistically-independent sources from electronic health record (EHR) data and create a 10-year risk-predictive model for CS in migraine patients. These sources represent external latent variables acting on the causal graph constructed from the EHR data and approximate root causes of CS in our population. A random forest model trained on patient expressions of these sources demonstrated good accuracy (ROC 0.771) and identified the top 10 most predictive sources of CS in migraine patients. These sources revealed that pharmacologic interventions were the most important factor in minimizing CS risk in our population and identified a factor related to allergic rhinitis as a potential causative source of CS in migraine patients.


Stress Monitoring in Healthcare: An Ensemble Machine Learning Framework Using Wearable Sensor Data

arXiv.org Artificial Intelligence

Healthcare professionals, particularly nurses, face elevated occupational stress, a concern amplified during the COVID-19 pandemic. While wearable sensors offer promising avenues for real-time stress monitoring, existing studies often lack comprehensive datasets and robust analytical frameworks. This study addresses these gaps by introducing a multimodal dataset comprising physiological signals, electrodermal activity, heart rate and skin temperature. A systematic literature review identified limitations in prior stress-detection methodologies, particularly in handling class imbalance and optimizing model generalizability. To overcome these challenges, the dataset underwent preprocessing with the Synthetic Minority Over sampling Technique (SMOTE), ensuring balanced representation of stress states. Advanced machine learning models including Random Forest, XGBoost and a Multi-Layer Perceptron (MLP) were evaluated and combined into a Stacking Classifier to leverage their collective predictive strengths. By using a publicly accessible dataset and a reproducible analytical pipeline, this work advances the development of deployable stress-monitoring systems, offering practical implications for safeguarding healthcare workers' mental health. Future research directions include expanding demographic diversity and exploring edge-computing implementations for low latency stress alerts.


A Machine Learning Framework for Breast Cancer Treatment Classification Using a Novel Dataset

arXiv.org Artificial Intelligence

Breast cancer (BC) remains a significant global health challenge, with personalized treatment selection complicated by the disease's molecular and clinical heterogeneity. BC treatment decisions rely on various patient-specific clinical factors, and machine learning (ML) offers a powerful approach to predicting treatment outcomes. This study utilizes The Cancer Genome Atlas (TCGA) breast cancer clinical dataset to develop ML models for predicting the likelihood of undergoing chemotherapy or hormonal therapy. The models are trained using five-fold cross-validation and evaluated through performance metrics, including accuracy, precision, recall, specificity, sensitivity, F1-score, and area under the receiver operating characteristic curve (AUROC). Model uncertainty is assessed using bootstrap techniques, while SHAP values enhance interpretability by identifying key predictors. Among the tested models, the Gradient Boosting Machine (GBM) achieves the highest stable performance (accuracy = 0.7718, AUROC = 0.8252), followed by Extreme Gradient Boosting (XGBoost) (accuracy = 0.7557, AUROC = 0.8044) and Adaptive Boosting (AdaBoost) (accuracy = 0.7552, AUROC = 0.8016). These findings underscore the potential of ML in supporting personalized breast cancer treatment decisions through data-driven insights.


The cost of ensembling: is it always worth combining?

arXiv.org Artificial Intelligence

Given the continuous increase in dataset sizes and the complexity of forecasting models, the trade-off between forecast accuracy and computational cost is emerging as an extremely relevant topic, especially in the context of ensemble learning for time series forecasting. To asses it, we evaluated ten base models and eight ensemble configurations across two large-scale retail datasets (M5 and VN1), considering both point and probabilistic accuracy under varying retraining frequencies. We showed that ensembles consistently improve forecasting performance, particularly in probabilistic settings. However, these gains come at a substantial computational cost, especially for larger, accuracy-driven ensembles. We found that reducing retraining frequency significantly lowers costs, with minimal impact on accuracy, particularly for point forecasts. Moreover, efficiency-driven ensembles offer a strong balance, achieving competitive accuracy with considerably lower costs compared to accuracy-optimized combinations. Most importantly, small ensembles of two or three models are often sufficient to achieve near-optimal results. These findings provide practical guidelines for deploying scalable and cost-efficient forecasting systems, supporting the broader goals of sustainable AI in forecasting. Overall, this work shows that careful ensemble design and retraining strategy selection can yield accurate, robust, and cost-effective forecasts suitable for real-world applications.