Goto

Collaborating Authors

 Performance Analysis


PCA-Guided Quantile Sampling: Preserving Data Structure in Large-Scale Subsampling

arXiv.org Machine Learning

We introduce Principal Component Analysis guided Quantile Sampling (PCA QS), a novel sampling framework designed to preserve both the statistical and geometric structure of large scale datasets. Unlike conventional PCA, which reduces dimensionality at the cost of interpretability, PCA QS retains the original feature space while using leading principal components solely to guide a quantile based stratification scheme. This principled design ensures that sampling remains representative without distorting the underlying data semantics. We establish rigorous theoretical guarantees, deriving convergence rates for empirical quantiles, Kullback Leibler divergence, and Wasserstein distance, thus quantifying the distributional fidelity of PCA QS samples. Practical guidelines for selecting the number of principal components, quantile bins, and sampling rates are provided based on these results. Extensive empirical studies on both synthetic and real-world datasets show that PCA QS consistently outperforms simple random sampling, yielding better structure preservation and improved downstream model performance. Together, these contributions position PCA QS as a scalable, interpretable, and theoretically grounded solution for efficient data summarization in modern machine learning workflows.


Neural Total Variation Distance Estimators for Changepoint Detection in News Data

arXiv.org Artificial Intelligence

Detecting when public discourse shifts in response to major events is crucial for understanding societal dynamics. Real-world data is high-dimensional, sparse, and noisy, making changepoint detection in this domain a challenging endeavor. In this paper, we leverage neural networks for changepoint detection in news data, introducing a method based on the so-called learning-by-confusion scheme, which was originally developed for detecting phase transitions in physical systems. We train classifiers to distinguish between articles from different time periods. The resulting classification accuracy is used to estimate the total variation distance between underlying content distributions, where significant distances highlight changepoints. We demonstrate the effectiveness of this method on both synthetic datasets and real-world data from The Guardian newspaper, successfully identifying major historical events including 9/11, the COVID-19 pandemic, and presidential elections. Our approach requires minimal domain knowledge, can autonomously discover significant shifts in public discourse, and yields a quantitative measure of change in content, making it valuable for journalism, policy analysis, and crisis monitoring.


HIDE and Seek: Detecting Hallucinations in Language Models via Decoupled Representations

arXiv.org Artificial Intelligence

Contemporary Language Models (LMs), while impressively fluent, often generate content that is factually incorrect or unfaithful to the input context - a critical issue commonly referred to as 'hallucination'. This tendency of LMs to generate hallucinated content undermines their reliability, especially because these fabrications are often highly convincing and therefore difficult to detect. While several existing methods attempt to detect hallucinations, most rely on analyzing multiple generations per input, leading to increased computational cost and latency. To address this, we propose a single-pass, training-free approach for effective Hallucination detectIon via Decoupled rEpresentations (HIDE). Our approach leverages the hypothesis that hallucinations result from a statistical decoupling between an LM's internal representations of input context and its generated output. We quantify this decoupling using the Hilbert-Schmidt Independence Criterion (HSIC) applied to hidden-state representations extracted while generating the output sequence. We conduct extensive experiments on four diverse question answering datasets, evaluating both faithfulness and factuality hallucinations across six open-source LMs of varying scales and properties. Our results demonstrate that HIDE outperforms other single-pass methods in almost all settings, achieving an average relative improvement of ~29% in AUC-ROC over the best-performing single-pass strategy across various models and datasets. Additionally, HIDE shows competitive and often superior performance with multi-pass state-of-the-art methods, obtaining an average relative improvement of ~3% in AUC-ROC while consuming ~51% less computation time. Our findings highlight the effectiveness of exploiting internal representation decoupling in LMs for efficient and practical hallucination detection.


CLOUD: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning

arXiv.org Artificial Intelligence

The prediction of crystal properties is essential for understanding structure-property relationships and accelerating the discovery of functional materials. However, conventional approaches relying on experimental measurements or density functional theory (DFT) calculations are often resource-intensive, limiting their scalability. Machine learning (ML) models offer a promising alternative by learning complex structure-property relationships from data, enabling faster predictions. Yet, existing ML models often rely on labeled data, adopt representations that poorly capture essential structural characteristics, and lack integration with physical principles--factors that limit their generalizability and interpretability. Here, we introduce CLOUD (Crystal Language mOdel for Unified and Differentiable materials modeling), a transformer-based framework trained on a novel Symmetry-Consistent Ordered Parameter Encoding (SCOPE) that encodes crystal symmetry, Wyckoff positions, and composition in a compact, coordinate-free string representation. Pre-trained on over six million crystal structures, CLOUD is fine-tuned on multiple downstream tasks and achieves competitive performance in predicting a wide range of material properties, demonstrating strong scaling performance. Furthermore, as proof of concept of differentiable materials modeling, CLOUD is applied to predict the phonon internal energy and heat capacity, which integrates the Debye model to preserve thermodynamic consistency. The CLOUD-DEBYE framework enforces thermodynamic consistency and enables temperature-dependent property prediction without requiring additional data. These results demonstrate the potential of CLOUD as a scalable and physics-informed foundation model for crystalline materials, unifying symmetry-consistent representations with physically grounded learning for property prediction and materials discovery.


On the Performance of Cyber-Biomedical Features for Intrusion Detection in Healthcare 5.0

arXiv.org Artificial Intelligence

Healthcare 5.0 integrates Artificial Intelligence (AI), the Internet of Things (IoT), real-time monitoring, and human-centered design toward personalized medicine and predictive diagnostics. However, the increasing reliance on interconnected medical technologies exposes them to cyber threats. Meanwhile, current AI-driven cybersecurity models often neglect biomedical data, limiting their effectiveness and interpretability. This study addresses this gap by applying eXplainable AI (XAI) to a Healthcare 5.0 dataset that integrates network traffic and biomedical sensor data. Classification outputs indicate that XGBoost achieved 99% F1-score for benign and data alteration, and 81% for spoofing. Explainability findings reveal that network data play a dominant role in intrusion detection whereas biomedical features contributed to spoofing detection, with temperature reaching a Shapley values magnitude of 0.37.


Efficient Malware Detection with Optimized Learning on High-Dimensional Features

arXiv.org Artificial Intelligence

Malware detection using machine learning requires feature extraction from binary files, as models cannot process raw binaries directly. A common approach involves using LIEF for raw feature extraction and the EMBER vectorizer to generate 2381-dimensional feature vectors. However, the high dimensionality of these features introduces significant computational challenges. This study addresses these challenges by applying two dimensionality reduction techniques: XGBoost-based feature selection and Principal Component Analysis (PCA). We evaluate three reduced feature dimensions (128, 256, and 384), which correspond to approximately 5.4%, 10.8%, and 16.1% of the original 2381 features, across four models-XGBoost, LightGBM, Extra Trees, and Random Forest-using a unified training, validation, and testing split formed from the EMBER-2018, ERMDS, and BODMAS datasets. This approach ensures generalization and avoids dataset bias. Experimental results show that LightGBM trained on the 384-dimensional feature set after XGBoost feature selection achieves the highest accuracy of 97.52% on the unified dataset, providing an optimal balance between computational efficiency and detection performance. The best model, trained in 61 minutes using 30 GB of RAM and 19.5 GB of disk space, generalizes effectively to completely unseen datasets, maintaining 95.31% accuracy on TRITIUM and 93.98% accuracy on INFERNO. These findings present a scalable, compute-efficient approach for malware detection without compromising accuracy.


CF-VLM:CounterFactual Vision-Language Fine-tuning

arXiv.org Artificial Intelligence

Recent advances in vision-language models (VLMs) have greatly improved cross-modal semantic understanding, yet significant limitations remain in fine-grained discrimination and deep causal reasoning tasks. Existing VLMs often rely on superficial statistical correlations, lacking the ability to capture the underlying causal logic between visual and textual content. To address this, we propose CounterFactual Vision-Language Fine-tuning (CF-VLM), a novel framework that enhances the causal reasoning capabilities of VLMs through the targeted use of counterfactual samples. CF-VLM introduces three complementary training objectives: maintaining foundational cross-modal alignment, reinforcing the uniqueness and stability of factual scene representations against coherent counterfactuals, and sharpening the model's sensitivity to minimal but critical causal edits. Extensive experiments demonstrate that CF-VLM consistently outperforms strong baselines and state-of-the-art methods on compositional reasoning and generalization benchmarks. Furthermore, it shows promise in mitigating visual hallucinations, indicating improved factual consistency. Our CF-VLM provides a robust foundation for deploying VLMs in high-stakes, real-world scenarios requiring reliable reasoning and interpretability.


Improving Prediction Certainty Estimation for Reliable Early Exiting via Null Space Projection

arXiv.org Artificial Intelligence

Early exiting has demonstrated great potential in accelerating the inference of pre-trained language models (PLMs) by enabling easy samples to exit at shallow layers, eliminating the need for executing deeper layers. However, existing early exiting methods primarily rely on class-relevant logits to formulate their exiting signals for estimating prediction certainty, neglecting the detrimental influence of class-irrelevant information in the features on prediction certainty. This leads to an overestimation of prediction certainty, causing premature exiting of samples with incorrect early predictions. To remedy this, we define an NSP score to estimate prediction certainty by considering the proportion of class-irrelevant information in the features. On this basis, we propose a novel early exiting method based on the Certainty-Aware Probability (CAP) score, which integrates insights from both logits and the NSP score to enhance prediction certainty estimation, thus enabling more reliable exiting decisions. The experimental results on the GLUE benchmark show that our method can achieve an average speed-up ratio of 2.19x across all tasks with negligible performance degradation, surpassing the state-of-the-art (SOTA) ConsistentEE by 28%, yielding a better trade-off between task performance and inference efficiency. The code is available at https://github.com/He-Jianing/NSP.git.


TrumorGPT: Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking

arXiv.org Artificial Intelligence

By effectively merging these two retrieval paradigms, the system would be capable of assembling a more comprehensive evidence base, thereby reducing the likelihood of missing pertinent details. Additionally, incorporating incremental graph update techniques would enable TrumorGPT to seamlessly integrate new medical studies and real-time health news without the need for extensive re-indexing or system downtime. This continuous update process is particularly crucial in the dynamic field of health, where the rapid emergence of new data can significantly impact the accuracy of fact-checking outcomes. In addition to efficient updating, implementing a dual-level retrieval strategy can further enhance contextual reasoning. Under this strategy, an initial coarse-grained retrieval would rapidly identify broad thematic and relational contexts, while a subsequent fine-grained search would extract specific factual details. This layered retrieval approach not only ensures that both high-level and granular information is captured but also supports more robust multi-hop reasoning by effectively bridging the gap between abstract concepts and concrete facts. Thus, these enhancements would bolster the fact-checking framework of TrumorGPT, striking an optimal balance between precision, efficiency, and comprehensive reasoning.


Novel Multicolumn Kernel Extreme Learning Machine for Food Detection via Optimal Features from CNN

arXiv.org Artificial Intelligence

Automatic food detection is an emerging topic of interest due to its wide array of applications ranging from detecting food images on social media platforms to filtering non-food photos from the users in dietary assessment apps. Recently, during the COVID-19 pandemic, it has facilitated enforcing an eating ban by automatically detecting eating activities from cameras in public places. Therefore, to tackle the challenge of recognizing food images with high accuracy, we proposed the idea of a hybrid framework for extracting and selecting optimal features from an efficient neural network. There on, a nonlinear classifier is employed to discriminate between linearly inseparable feature vectors with great precision. In line with this idea, our method extracts features from MobileNetV3, selects an optimal subset of attributes by using Shapley Additive ex-Planations (SHAP) values, and exploits kernel extreme learning machine (KELM) due to its nonlinear decision boundary and good generalization ability. However, KELM suffers from the'curse of dimensionality problem' for large datasets due to the complex computation of kernel matrix with large numbers of hidden nodes. We solved this problem by proposing a novel multicolumn kernel extreme learning machine (MCK-ELM) which exploited the k-d tree algorithm to divide data into N subsets and trains separate KELM on each subset of data. Then, the method incorporates KELM classifiers into parallel structures and selects the top k nearest subsets during testing by using the k-d tree search for classifying input instead of the whole network. Experimental results showed the superiority of our method on an integrated set of measures while solving the problem of'curse of dimensionality in KELM for large datasets. Keywords: Multicolumn, Kernel Extreme Learning Machine, MobileNet, Food Detection, Explainable AI, SHAP 1. Introduction Automatic detection of food images applications includes visual-based dietary assessment and detecting eating activities from wearable camera photos. The visual-based dietary assessment method reduces the burden of manually collecting the food data by helping users in refreshing their memory using the food pictures of the previous dietary intake. Filtering of non-food images from users is an essential step in these mHealth apps to ensure relevant images are analyzed.