Performance Analysis
Mapping Toxic Comments Across Demographics: A Dataset from German Public Broadcasting
Fillies, Jan, Hoffmann, Michael Peter, Reichel, Rebecca, Salzwedel, Roman, Bodemer, Sven, Paschke, Adrian
A lack of demographic context in existing toxic speech datasets limits our understanding of how different age groups communicate online. In collaboration with funk, a German public service content network, this research introduces the first large-scale German dataset annotated for toxicity and enriched with platform-provided age estimates. The dataset includes 3,024 human-annotated and 30,024 LLM-annotated anonymized comments from Instagram, TikTok, and YouTube. To ensure relevance, comments were consolidated using predefined toxic keywords, resulting in 16.7\% labeled as problematic. The annotation pipeline combined human expertise with state-of-the-art language models, identifying key categories such as insults, disinformation, and criticism of broadcasting fees. The dataset reveals age-based differences in toxic speech patterns, with younger users favoring expressive language and older users more often engaging in disinformation and devaluation. This resource provides new opportunities for studying linguistic variation across demographics and supports the development of more equitable and age-aware content moderation systems.
Normalisation of SWIFT Message Counterparties with Feature Extraction and Clustering
Schoinas, Thanasis, Guinard, Benjamin, Esbati, Diba, Chalk, Richard
Short text clustering is a known use case in the text analytics community. When the structure and content falls in the natural language domain e.g. Twitter posts or instant messages, then natural language techniques can be used, provided texts are of sufficient length to allow for use of (pre)trained models to extract meaningful information, such as part-of-speech or topic annotations. However, natural language models are not suitable for clustering transaction counterparties, as they are found in bank payment messaging systems, such as SWIFT. The manually typed tags are typically physical or legal entity details, which lack sentence structure, while containing all the variations and noise that manual entry introduces. This leaves a gap in an investigator or counter-fraud professional's toolset when looking to augment their knowledge of payment flow originator and beneficiary entities and trace funds and assets. A gap that vendors traditionally try to close with fuzzy matching tools. With these considerations in mind, we are proposing a hybrid string similarity, topic modelling, hierarchical clustering and rule-based pipeline to facilitate clustering of transaction counterparties, also catering for unknown number of expected clusters. We are also devising metrics to supplement the evaluation of the approach, based on the well-known measures of precision and recall. Testing on a real-life labelled dataset demonstrates significantly improved performance over a baseline rule-based ('keyword') approach. The approach retains most of the interpretability found in rule-based systems, as the former adds an additional level of cluster refinement to the latter. The resulting workflow reduces the need for manual review. When only a subset of the population needs to be investigated, such as in sanctions investigations, the approach allows for better control of the risks of missing entity variations.
Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS$^2$-based Proteomics
Xu, Hao, Wang, Zhichao, Sang, Shengqi, Wajanasara, Pisit, Bandeira, Nuno
Proteins perform nearly all cellular functions and constitute most drug targets, making their analysis fundamental to understanding human biology in health and disease. Tandem mass spectrometry (MS$^2$) is the major analytical technique in proteomics that identifies peptides by ionizing them, fragmenting them, and using the resulting mass spectra to identify and quantify proteins in biological samples. In MS$^2$ analysis, peptide fragment ion probability prediction plays a critical role, enhancing the accuracy of peptide identification from mass spectra as a complement to the intensity information. Current approaches rely on global statistics of fragmentation, which assumes that a fragment's probability is uniform across all peptides. Nevertheless, this assumption is oversimplified from a biochemical principle point of view and limits accurate prediction. To address this gap, we present Pep2Prob, the first comprehensive dataset and benchmark designed for peptide-specific fragment ion probability prediction. The proposed dataset contains fragment ion probability statistics for 608,780 unique precursors (each precursor is a pair of peptide sequence and charge state), summarized from more than 183 million high-quality, high-resolution, HCD MS$^2$ spectra with validated peptide assignments and fragmentation annotations. We establish baseline performance using simple statistical rules and learning-based methods, and find that models leveraging peptide-specific information significantly outperform previous methods using only global fragmentation statistics. Furthermore, performance across benchmark models with increasing capacities suggests that the peptide-fragmentation relationship exhibits complex nonlinearities requiring sophisticated machine learning approaches.
Improving Hospital Risk Prediction with Knowledge-Augmented Multimodal EHR Modeling
Datta, Rituparna, Cui, Jiaming, Guan, Zihan, Reddy, Vishal G., Eby, Joshua C., Madden, Gregory, Silwal, Rupesh, Vullikanti, Anil
Accurate prediction of clinical outcomes using Electronic Health Records (EHRs) is critical for early intervention, efficient resource allocation, and improved patient care. EHRs contain multimodal data, including both structured data and unstructured clinical notes that provide rich, context-specific information. In this work, we introduce a unified framework that seamlessly integrates these diverse modalities, leveraging all relevant available information through a two-stage architecture for clinical risk prediction. In the first stage, a fine-tuned Large Language Model (LLM) extracts crucial, task-relevant information from clinical notes, which is enhanced by graph-based retrieval of external domain knowledge from sources such as a medical corpus like PubMed, grounding the LLM's understanding. The second stage combines both unstructured representations and features derived from the structured data to generate the final predictions. This approach supports a wide range of clinical tasks. Here, we demonstrate its effectiveness on 30-day readmission and in-hospital mortality prediction. Experimental results show that our framework achieves strong performance, with AUC scores of $0.84$ and $0.92$, respectively, despite these tasks involving severely imbalanced datasets, with positive rates ranging from approximately $4\%$ to $13\%$. Moreover, it outperforms all existing baselines and clinical practices, including established risk scoring systems. To the best of our knowledge, this is one of the first frameworks for healthcare prediction which enhances the power of an LLM-based graph-guided knowledge retrieval method by combining it with structured data for improved clinical outcome prediction.
Encoding Tactile Stimuli for Organoid Intelligence in Braille Recognition
Liu, Tianyi, Philamore, Hemma, Ward-Cherrier, Benjamin
This study proposes a generalizable encoding strategy that maps tactile sensor data to electrical stimulation patterns, enabling neural organoids to perform an open-loop artificial tactile Braille classification task. Human forebrain organoids cultured on a low-density microelectrode array (MEA) are systematically stimulated to characterize the relationship between electrical stimulation parameters (number of pulse, phase amplitude, phase duration, and trigger delay) and organoid responses, measured as spike activity and spatial displacement of the center of activity. Implemented on event-based tactile inputs recorded from the Evetac sensor, our system achieved an average Braille letter classification accuracy of 61 percent with a single organoid, which increased significantly to 83 percent when responses from a three-organoid ensemble were combined. Additionally, the multi-organoid configuration demonstrated enhanced robustness against various types of artificially introduced noise. This research demonstrates the potential of organoids as low-power, adaptive bio-hybrid computational elements and provides a foundational encoding framework for future scalable bio-hybrid computing architectures.
JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring
Chu, Junjie, Li, Mingjie, Yang, Ziqing, Leng, Ye, Lin, Chenhao, Shen, Chao, Backes, Michael, Shen, Yun, Zhang, Yang
Accurately determining whether a jailbreak attempt has succeeded is a fundamental yet unresolved challenge. Existing evaluation methods rely on misaligned proxy indicators or naive holistic judgments. They frequently misinterpret model responses, leading to inconsistent and subjective assessments that misalign with human perception. To address this gap, we introduce JADES (Jailbreak Assessment via Decompositional Scoring), a universal jailbreak evaluation framework. Its key mechanism is to automatically decompose an input harmful question into a set of weighted sub-questions, score each sub-answer, and weight-aggregate the sub-scores into a final decision. JADES also incorporates an optional fact-checking module to strengthen the detection of hallucinations in jailbreak responses. We validate JADES on JailbreakQR, a newly introduced benchmark proposed in this work, consisting of 400 pairs of jailbreak prompts and responses, each meticulously annotated by humans. In a binary setting (success/failure), JADES achieves 98.5% agreement with human evaluators, outperforming strong baselines by over 9%. Re-evaluating five popular attacks on four LLMs reveals substantial overestimation (e.g., LAA's attack success rate on GPT-3.5-Turbo drops from 93% to 69%). Our results show that JADES could deliver accurate, consistent, and interpretable evaluations, providing a reliable basis for measuring future jailbreak attacks.
Safer Skin Lesion Classification with Global Class Activation Probability Map Evaluation and SafeML
Paxton, Kuniko, Aslansefat, Koorosh, Akagiฤ, Amila, Thakker, Dhavalkumar, Papadopoulos, Yiannis
Recent advancements in skin lesion classification models have significantly improved accuracy, with some models even surpassing dermatologists' diagnostic performance. However, in medical practice, distrust in AI models remains a challenge. Beyond high accuracy, trustworthy, explainable diagnoses are essential. Existing explainability methods have reliability issues, with LIME-based methods suffering from inconsistency, while CAM-based methods failing to consider all classes. To address these limitations, we propose Global Class Activation Probabilistic Map Evaluation, a method that analyses all classes' activation probability maps probabilistically and at a pixel level. By visualizing the diagnostic process in a unified manner, it helps reduce the risk of misdiagnosis. Furthermore, the application of SafeML enhances the detection of false diagnoses and issues warnings to doctors and patients as needed, improving diagnostic reliability and ultimately patient safety. We evaluated our method using the ISIC datasets with MobileNetV2 and Vision Transformers.
Signs of Struggle: Spotting Cognitive Distortions across Language and Register
Kuber, Abhishek, Liscio, Enrico, Zhang, Ruixuan, Figueroa, Caroline, Murukannaiah, Pradeep K.
Rising mental health issues among youth have increased interest in automated approaches for detecting early signs of psychological distress in digital text. One key focus is the identification of cognitive distortions, irrational thought patterns that have a role in aggravating mental distress. Early detection of these distortions may enable timely, low-cost interventions. While prior work has focused on English clinical data, we present the first in-depth study of cross-lingual and cross-register generalization of cognitive distortion detection, analyzing forum posts written by Dutch adolescents. Our findings show that while changes in language and writing style can significantly affect model performance, domain adaptation methods show the most promise.
Evaluating the Usefulness of Non-Diagnostic Speech Data for Developing Parkinson's Disease Classifiers
Zhong, Terry Yi, Janse, Esther, Tejedor-Garcia, Cristian, Bosch, Louis ten, Larson, Martha
Speech-based Parkinson's disease (PD) detection has gained attention for its automated, cost-effective, and non-intrusive nature. As research studies usually rely on data from diagnostic-oriented speech tasks, this work explores the feasibility of diagnosing PD on the basis of speech data not originally intended for diagnostic purposes, using the Turn-Taking (TT) dataset. Our findings indicate that TT can be as useful as diagnostic-oriented PD datasets like PC-GIT A. We also investigate which specific dataset characteristics impact PD classification performance. The results show that concatenating audio recordings and balancing participants' gender and status distributions can be beneficial. Cross-dataset evaluation reveals that models trained on PC-GIT A generalize poorly to TT, whereas models trained on TT perform better on PC-GIT A. Furthermore, we provide insights into the high variability across folds, which is mainly due to large differences in individual speaker performance.
A Novel Framework for Automated Explain Vision Model Using Vision-Language Models
Nguyen, Phu-Vinh, Pham, Tan-Hanh, Ngo, Chris, Hy, Truong Son
The development of many vision models mainly focuses on improving their performance using metrics such as accuracy, IoU, and mAP, with less attention to explainability due to the complexity of applying xAI methods to provide a meaningful explanation of trained models. Although many existing xAI methods aim to explain vision models sample-by-sample, methods explaining the general behavior of vision models, which can only be captured after running on a large dataset, are still underexplored. Furthermore, understanding the behavior of vision models on general images can be very important to prevent biased judgments and help identify the model's trends and patterns. With the application of Vision-Language Models, this paper proposes a pipeline to explain vision models at both the sample and dataset levels. The proposed pipeline can be used to discover failure cases and gain insights into vision models with minimal effort, thereby integrating vision model development with xAI analysis to advance image analysis.