Performance Analysis
Towards Long Context Hallucination Detection
Liu, Siyi, Halder, Kishaloy, Qi, Zheng, Xiao, Wei, Pappas, Nikolaos, Htut, Phu Mon, John, Neha Anna, Benajiba, Yassine, Roth, Dan
Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, they are prone to contextual hallucination, generating information that is either unsubstantiated or contradictory to the given context. Although many studies have investigated contextual hallucinations in LLMs, addressing them in long-context inputs remains an open problem. In this work, we take an initial step toward solving this problem by constructing a dataset specifically designed for long-context hallucination detection. Furthermore, we propose a novel architecture that enables pre-trained encoder models, such as BERT, to process long contexts and effectively detect contextual hallucinations through a decomposition and aggregation mechanism. Our experimental results show that the proposed architecture significantly outperforms previous models of similar size as well as LLM-based models across various metrics, while providing substantially faster inference.
GLaMoR: Consistency Checking of OWL Ontologies using Graph Language Models
Mรผcke, Justin, Scherp, Ansgar
--Semantic reasoning aims to infer new knowledge from existing knowledge, with OWL ontologies serving as a standardized framework for organizing information. A key challenge in semantic reasoning is verifying ontology consistency. However, state-of-the-art reasoners are computationally expensive, and their efficiency decreases as ontology sizes grow. While classical machine learning models have been explored for consistency checking, they struggle to capture complex relationships within ontologies. Large language models (LLMs) have shown promising results for simple reasoning tasks but perform poorly on structured reasoning. The recently introduced Graph Language Model (GLM) offers a way to simultaneously process graph-structured data and text. This paper proposes GLaMoR (Graph Language Model for Reasoning), a reasoning pipeline that transforms OWL ontologies into graph-structured data and adapts the GLM architecture for consistency checking. We evaluate GLaMoR on ontologies from the NCBO BioPortal repository, converting them into triples suitable for model input. Our results show that the GLM outperforms all baseline models, achieving 95% accuracy while being 20 times faster than classical reasoners. With the increasing complexity of knowledge representation and reasoning systems, ontologies play a vital role in structuring domain knowledge across various fields, e. g., biomedical expert knowledge. OWL provides a stable foundation for diverse tasks based on ontologies. OWL 2 [1] is based on the SROIQ [2] description logic, which supports complex reasoning while maintaining logical consistency. To derive additional knowledge from these ontologies, semantic reasoners are employed to infer new facts through logical entailment. These reasoners are critical in supporting key tasks such as classification, query answering, and consistency checking by leveraging formal logic systems for precise and reliable inference. A prominent example is HermiT [3], an OWL 2-compliant reasoner that uses hyper-tableau calculus to perform reasoning tasks efficiently.
Can Third-parties Read Our Emotions?
Li, Jiayi, Zhou, Yingfan, Venkit, Pranav Narayanan, Islam, Halima Binte, Arya, Sneha, Wilson, Shomir, Rajtmajer, Sarah
Natural Language Processing tasks that aim to infer an author's private states, e.g., emotions and opinions, from their written text, typically rely on datasets annotated by third-party annotators. However, the assumption that third-party annotators can accurately capture authors' private states remains largely unexamined. In this study, we present human subjects experiments on emotion recognition tasks that directly compare third-party annotations with first-party (author-provided) emotion labels. Our findings reveal significant limitations in third-party annotations-whether provided by human annotators or large language models (LLMs)-in faithfully representing authors' private states. However, LLMs outperform human annotators nearly across the board. We further explore methods to improve third-party annotation quality. We find that demographic similarity between first-party authors and third-party human annotators enhances annotation performance. While incorporating first-party demographic information into prompts leads to a marginal but statistically significant improvement in LLMs' performance. We introduce a framework for evaluating the limitations of third-party annotations and call for refined annotation practices to accurately represent and model authors' private states.
Unsupervised outlier detection to improve bird audio dataset labels
The Xeno -Canto bird audio repository is an invaluable resource for those interested in vocalizations and other sounds made by birds around the world. This is particularly the case for machine learning researchers attempting to improve on the bird species r ecognition accuracy of classification models. However, the task of extracting labeled datasets from th e recordings found in this crowd -sourced repository faces several challenges. One challenge of particular significance to machine learning practitioners i s that one bird species label is applied to each audio recording, but frequently other sounds are also captured including other bird species, other animal sounds, anthropogenic and other ambient sounds . These non -target bird species sounds can result in dataset labeling discrepanc ies referred to as label noise . In this work we present a cleaning process consisting of audio preprocessing followed by dimensionality reduction and unsupervised outlier detection (UOD) to reduce the label noise in a dataset derived from Xeno -Canto recordings . We investigate three neural network dimensionality reduction techniques: two flavors of convolutional autoencoder s and variational deep embedding (VaDE (Jiang, 2017)) . While both methods show some degree of effectiveness at detecting outliers for most bird species datasets, we f ound significant variation in the performance of the methods from one species to the next. We believe that the results of this investigation demonstrate that the application of our cleaning process can meaningfully reduce the label noise of bird species datasets derived from Xeno-Canto audio repository but results vary across species.
A Gradient-Optimized TSK Fuzzy Framework for Explainable Phishing Detection
Pentapalli, Lohith Srikanth, Salisbury, Jon, Riep, Josette, Cohen, Kelly
Phishing attacks represent an increasingly sophisticated and pervasive threat to individuals and organizations, causing significant financial losses, identity theft, and severe damage to institutional reputations. Existing phishing detection methods often struggle to simultaneously achieve high accuracy and explainability, either failing to detect novel attacks or operating as opaque black-box models. To address this critical gap, we propose a novel phishing URL detection system based on a first-order Takagi-Sugeno-Kang (TSK) fuzzy inference model optimized through gradient-based techniques. Our approach intelligently combines the interpretability and human-like reasoning capabilities of fuzzy logic with the precision and adaptability provided by gradient optimization methods, specifically leveraging the Adam optimizer for efficient parameter tuning. Experiments conducted using a comprehensive dataset of over 235,000 URLs demonstrate rapid convergence, exceptional predictive performance (accuracy averaging 99.95% across 5 cross-validation folds, with a perfect AUC i.e. 1.00). Furthermore, optimized fuzzy rules and membership functions improve interoperability, clearly indicating how the model makes decisions - an essential feature for cybersecurity applications. This high-performance, transparent, and interpretable phishing detection framework significantly advances current cybersecurity defenses, providing practitioners with accurate and explainable decision-making tools.
Explainable Deep-Learning Based Potentially Hazardous Asteroids Classification Using Graph Neural Networks
--Classifying potentially hazardous asteroids (PHAs) is crucial for planetary defense and deep space navigation, yet traditional methods often overlook the dynamical relationships among asteroids. We introduce a Graph Neural Network (GNN) approach that models asteroids as nodes with orbital and physical features, connected by edges representing their similarities, using a NASA dataset of 958,524 records. Despite an extreme class imbalance with only 0.22% of the dataset with hazardous label, our model achieves an overall accuracy of 99% and an AUC of 0.99, with a recall of 78% and an F1-score of 37% for hazardous asteroids after applying Synthetic Minority Oversampling T echnique. Feature importance analysis highlights albedo, perihelion distance, and semi-major axis as main predictors. This framework supports planetary defense missions and confirm AI's potential in enabling autonomous navigation for future missions such as NASA's NEO Surveyor and ESA's Ramses, offering an interpretable and scalable solution for asteroid hazard assessment. However, a small subset known as potentially hazardous asteroids (PHAs) follow orbits that bring them perilously close to our planet, raising the specter of catastrophic collisions. Historical events, such as the 1908 Tunguska explosion [1], which devastated over 2,000 square kilometers of Siberian forest, and the 2013 Chelyabinsk meteor [2], which injured over 1,000 people and caused widespread property damage, show the destructive potential of these celestial bodies.
A Hybrid Framework for Real-Time Data Drift and Anomaly Identification Using Hierarchical Temporal Memory and Statistical Tests
Bandyopadhyay, Subhadip, Bose, Joy, Chowdhury, Sujoy Roy
Data Drift is the phenomenon where the generating model behind the data changes over time. Due to data drift, any model built on the past training data becomes less relevant and inaccurate over time. Thus, detecting and controlling for data drift is critical in machine learning models. Hierarchical Temporal Memory (HTM) is a machine learning model developed by Jeff Hawkins, inspired by how the human brain processes information. It is a biologically inspired model of memory that is similar in structure to the neocortex, and whose performance is claimed to be comparable to state of the art models in detecting anomalies in time series data. Another unique benefit of HTMs is its independence from training and testing cycle; all the learning takes place online with streaming data and no separate training and testing cycle is required. In sequential learning paradigm, Sequential Probability Ratio Test (SPRT) offers some unique benefit for online learning and inference. This paper proposes a novel hybrid framework combining HTM and SPRT for real-time data drift detection and anomaly identification. Unlike existing data drift methods, our approach eliminates frequent retraining and ensures low false positive rates. HTMs currently work with one dimensional or univariate data. In a second study, we also propose an application of HTM in multidimensional supervised scenario for anomaly detection by combining the outputs of multiple HTM columns, one for each dimension of the data, through a neural network. Experimental evaluations demonstrate that the proposed method outperforms conventional drift detection techniques like the Kolmogorov-Smirnov (KS) test, Wasserstein distance, and Population Stability Index (PSI) in terms of accuracy, adaptability, and computational efficiency. Our experiments also provide insights into optimizing hyperparameters for real-time deployment in domains such as Telecom.
Severity Classification of Chronic Obstructive Pulmonary Disease in Intensive Care Units: A Semi-Supervised Approach Using MIMIC-III Dataset
Shojaei, Akram, Delrobaei, Mehdi
Chronic obstructive pulmonary disease (COPD) is a major global health concern, with accurate severity assessment crucial for effective management, especially in intensive care units (ICUs). This study presents a novel approach to COPD sever - ity classification using machine learning algorithms applied to the MIMIC - III dataset. Our work presents a new application of the MIMIC - III dataset and con - tributes to the growing field of artificial intelligence in critical care medicine. We developed a model to classify COPD severity based on available ICU parameters, including blood gas measurements and vital signs. Our methodology incorpo - rated semi - supervised learning techniques to leverage unlabeled data, enhancing model robustness. A random forest classifier demonstrated superior performance, achieving 92.51% accuracy and 0.98 ROC AUC distinguishing between mild - to - moderate and severe COPD cases. This approach offers a practical, accurate, and accessible tool for rapid COPD severity assessment in ICU settings, poten - tially improving clinical decision - making and patient outcomes. Future research should focus on external validation and integration into clinical decision support systems to enhance COPD management in the ICUs.
Feature Selection via GANs (GANFS): Enhancing Machine Learning Models for DDoS Mitigation
Distributed Denial of Service (DDoS) attacks represent a persistent and evolving threat to modern networked systems, capable of causing large-scale service disruptions. The complexity of such attacks, often hidden within high-dimensional and redundant network traffic data, necessitates robust and intelligent feature selection techniques for effective detection. Traditional methods such as filter-based, wrapper-based, and embedded approaches, each offer strengths but struggle with scalability or adaptability in complex attack environments. In this study, we explore these existing techniques through a detailed comparative analysis and highlight their limitations when applied to large-scale DDoS detection tasks. Building upon these insights, we introduce a novel Generative Adversarial Network-based Feature Selection (GANFS) method that leverages adversarial learning dynamics to identify the most informative features. By training a GAN exclusively on attack traffic and employing a perturbation-based sensitivity analysis on the Discriminator, GANFS effectively ranks feature importance without relying on full supervision. Experimental evaluations using the CIC-DDoS2019 dataset demonstrate that GANFS not only improves the accuracy of downstream classifiers but also enhances computational efficiency by significantly reducing feature dimensionality. These results point to the potential of integrating generative learning models into cybersecurity pipelines to build more adaptive and scalable detection systems.
Deep Learning with Pretrained 'Internal World' Layers: A Gemma 3-Based Modular Architecture for Wildfire Prediction
Jadouli, Ayoub, Amrani, Chaker El
Deep learning models, especially large Transformers, carry substantial "memory" in their intermediate layers -- an \emph{internal world} that encodes a wealth of relational and contextual knowledge. This work harnesses that internal world for wildfire occurrence prediction by introducing a modular architecture built upon Gemma 3, a state-of-the-art multimodal model. Rather than relying on Gemma 3's original embedding and positional encoding stacks, we develop a custom feed-forward module that transforms tabular wildfire features into the hidden dimension required by Gemma 3's mid-layer Transformer blocks. We freeze these Gemma 3 sub-layers -- thus preserving their pretrained representation power -- while training only the smaller input and output networks. This approach minimizes the number of trainable parameters and reduces the risk of overfitting on limited wildfire data, yet retains the benefits of Gemma 3's broad knowledge. Evaluations on a Moroccan wildfire dataset demonstrate improved predictive accuracy and robustness compared to standard feed-forward and convolutional baselines. Ablation studies confirm that the frozen Transformer layers consistently contribute to better representations, underscoring the feasibility of reusing large-model mid-layers as a learned internal world. Our findings suggest that strategic modular reuse of pretrained Transformers can enable more data-efficient and interpretable solutions for critical environmental applications such as wildfire risk management.