severity level
A.1 Conjugate Derivations Cross-Entropy Loss: L(h,y) = cX
Pc i=1 yi = 1is satisfied, otherwise f (y) = by duality. A.2 Experiments on Binary Classification with Exponential Loss Here we present the results on a binary classification task over a synthetic dataset of 100 dimensional gaussian clusters. For Σ, similar to [23], we sample a diagonal matrix D, where each entry is sampled uniformly from a specified range, and a rotation matrix U from a HAAR distribution, giving Σ = UDUT. For the source data, we sample µ 1s,µ+1s,Σ 1s,Σ+1sas specified above with k = 0. Now to create a distribution shifted data of various severity, we sample µ 1t,µ+1t,Σ 1t,Σ+1tas specified above with k = 1, which are then used to sample the shifted data as follows: Exponential Loss for Binary Classification Let z be the classification score hθ(x). For logistic training loss, conjugate adaptation loss would default to entropy with sigmoid probability.
Y ouTubePD: A Multimodal Benchmark for Parkinson's Disease Analysis Supplementary Material
We include all our annotations and extracted landmarks. This ensures that we uphold the highest standards of ethical data usage. In Table A1, we summarize the severity label distribution in Y ouTubePD. We also summarize the demographic distribution in Y ouTubePD, split between PD-positive and healthy control (HC), or PD-negative, subjects. This decision is based on the clinician's suggestion, since an accurate UPDRS facial expression rating would require more This strategy also allows for a finer classification.
Supplement: SingleModelUncertaintyEstimationvia StochasticDataCentering APPENDIX
For demonstration, let us consider the1D regression example showedinFigure 1andtrain UQmodels under different trainsample sizes(5,10,50 and 200 respectively). The figure illustrates the predicted function and the associated uncertainty estimates (shaded region around thepredictions). ''' model: network trained with anchoring anchors: set of randomly chosen anchors (ideally from train dist.) Foreach case, we showthe negative log-likelihood for the test data obtained using each of the methods. Note, all metrics were computed as an average from20 random trials of0.8 0.2 train-test split.
Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches
Phukon, Bornali, Zheng, Xiuwen, Hasegawa-Johnson, Mark
Traditional ASR metrics like WER and CER fail to capture intelligibility, especially for dysarthric and dysphonic speech, where semantic alignment matters more than exact word matches. ASR systems struggle with these speech types, often producing errors like phoneme repetitions and imprecise consonants, yet the meaning remains clear to human listeners. We identify two key challenges: (1) Existing metrics do not adequately reflect intelligibility, and (2) while LLMs can refine ASR output, their effectiveness in correcting ASR transcripts of dysarthric speech remains underexplored. To address this, we propose a novel metric integrating Natural Language Inference (NLI) scores, semantic similarity, and phonetic similarity. Our ASR evaluation metric achieves a 0.890 correlation with human judgments on Speech Accessibility Project data, surpassing traditional methods and emphasizing the need to prioritize intelligibility over error-based measures.
VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Palaskar, Shruti, Gatys, Leon, Abdelrahman, Mona, Jacobo, Mar, Lindsey, Larry, Moharir, Rutika, Lund, Gunnar, Xu, Yang, Shiee, Navid, Bigham, Jeffrey, Maalouf, Charles, Cheng, Joseph Yitan
Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.
Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances
Singh, Rishu Kumar, Shreya, Navneet, Das, Sarmistha, Singh, Apoorva, Saha, Sriparna
Existing approaches to complaint analysis largely rely on unimodal, short-form content such as tweets or product reviews. This work advances the field by leveraging multimodal, multi-turn customer support dialogues, where users often share both textual complaints and visual evidence (e.g., screenshots, product photos) to enable fine-grained classification of complaint aspects and severity. We introduce VALOR, a Validation-Aware Learner with Expert Routing, tailored for this multimodal setting. It employs a multi-expert reasoning setup using large-scale generative models with Chain-of-Thought (CoT) prompting for nuanced decision-making. To ensure coherence between modalities, a semantic alignment score is computed and integrated into the final classification through a meta-fusion strategy. In alignment with the United Nations Sustainable Development Goals (UN SDGs), the proposed framework supports SDG 9 (Industry, Innovation and Infrastructure) by advancing AI-driven tools for robust, scalable, and context-aware service infrastructure. Further, by enabling structured analysis of complaint narratives and visual context, it contributes to SDG 12 (Responsible Consumption and Production) by promoting more responsive product design and improved accountability in consumer services. We evaluate VALOR on a curated multimodal complaint dataset annotated with fine-grained aspect and severity labels, showing that it consistently outperforms baseline models, especially in complex complaint scenarios where information is distributed across text and images. This study underscores the value of multimodal interaction and expert validation in practical complaint understanding systems. Resources related to data and codes are available here: https://github.com/sarmistha-D/VALOR
Test Time Adaptation Using Adaptive Quantile Recalibration
Mehrbod, Paria, Vianna, Pedro, Nanfack, Geraldin, Wolf, Guy, Belilovsky, Eugene
Domain adaptation is a key strategy for enhancing the generalizability of deep learning models in real-world scenarios, where test distributions often diverge significantly from the training domain. However, conventional approaches typically rely on prior knowledge of the target domain or require model retraining, limiting their practicality in dynamic or resource-constrained environments. Recent test-time adaptation methods based on batch normalization statistic updates allow for unsupervised adaptation, but they often fail to capture complex activation distributions and are constrained to specific normalization layers. We propose Adaptive Quantile Recalibration (AQR), a test-time adaptation technique that modifies pre-activation distributions by aligning quantiles on a channel-wise basis. AQR captures the full shape of activation distributions and generalizes across architectures employing BatchNorm, GroupNorm, or LayerNorm. To address the challenge of estimating distribution tails under varying batch sizes, AQR incorporates a robust tail calibration strategy that improves stability and precision. Our method leverages source-domain statistics computed at training time, enabling unsupervised adaptation without retraining models. Experiments on CIFAR-10-C, CIFAR-100-C, and ImageNet-C across multiple architectures demonstrate that AQR achieves robust adaptation across diverse settings, outperforming existing test-time adaptation baselines. These results highlight AQR's potential for deployment in real-world scenarios with dynamic and unpredictable data distributions.