Performance Analysis
Improved Vehicle Maneuver Prediction using Game Theoretic Priors
Conventional maneuver prediction methods use some sort of classification model on temporal trajectory data to predict behavior of agents over a set time horizon. Despite of having the best precision and recall, these models cannot predict a lane change accurately unless they incorporate information about the entire scene. Level-k game theory can leverage the human-like hierarchical reasoning to come up with the most rational decisions each agent can make in a group. This can be leveraged to model interactions between different vehicles in presence of each other and hence compute the most rational decisions each agent would make. The result of game theoretic evaluation can be used as a "prior" or combined with a traditional motion-based classification model to achieve more accurate predictions. The proposed approach assumes that the states of the vehicles around the target lead vehicle are known. The module will output the most rational maneuver prediction of the target vehicle based on an online optimization solution. These predictions are instrumental in decision making systems like Adaptive Cruise Control (ACC) or Traxen's iQ-Cruise further improving the resulting fuel savings.
Towards Minimal Causal Representations for Human Multimodal Language Understanding
Jiang, Menghua, Jiang, Yuncheng, Hu, Haifeng, Mai, Sijie
Human Multimodal Language Understanding (MLU) aims to infer human intentions by integrating related cues from heterogeneous modalities. Existing works predominantly follow a ``learning to attend" paradigm, which maximizes mutual information between data and labels to enhance predictive performance. However, such methods are vulnerable to unintended dataset biases, causing models to conflate statistical shortcuts with genuine causal features and resulting in degraded out-of-distribution (OOD) generalization. To alleviate this issue, we introduce a Causal Multimodal Information Bottleneck (CaMIB) model that leverages causal principles rather than traditional likelihood. Concretely, we first applies the information bottleneck to filter unimodal inputs, removing task-irrelevant noise. A parameterized mask generator then disentangles the fused multimodal representation into causal and shortcut subrepresentations. To ensure global consistency of causal features, we incorporate an instrumental variable constraint, and further adopt backdoor adjustment by randomly recombining causal and shortcut features to stabilize causal estimation. Extensive experiments on multimodal sentiment analysis, humor detection, and sarcasm detection, along with OOD test sets, demonstrate the effectiveness of CaMIB. Theoretical and empirical analyses further highlight its interpretability and soundness.
Exploring the Relationships Between Physiological Signals During Automated Fatigue Detection
Kakhi, Kourosh, Khosravi, Abbas, Alizadehsani, Roohallah, Acharyab, U. Rajendra
Background: Fatigue detection through physiological signals has gained growing relevance across safety-critical domains such as transportation, healthcare, and human performance monitoring. While many studies focus on individual modalities (e.g., EEG or ECG), limited attention has been given to investigating statistical relationships between signal pairs as a means to enhance classification robustness. This study aims to explore how inter-signal statistical features correlation, cross-correlation, and covariance across multiple physiological signals can support fatigue state prediction. Methodology: Using the DROZY dataset, we extracted pairwise statistical features from four physiological signals: ECG, EMG, EOG, and EEG. Fifteen distinct signal combinations were evaluated, covering uni-modal to multi-modal configurations. Feature extraction emphasized statistical relationships between signals rather than raw amplitude characteristics. The extracted features were fed into four supervised machine learning classifiers: Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), and XGBoost (XGB). Performance was assessed using accuracy, precision, recall, and area under the curve (AUC). Additionally, SHAP (SHapley Additive exPlanations) values were computed to evaluate feature importance and interpret model behavior. Results: Among all classifiers and signal combinations, XGBoost applied to the EMG| EEG combination achieved the highest classification performance, with an accuracy of 0.888 and an AUC of 0.975. SHAP-based ranking revealed that the correlation between ECG and EOG-H was the most influential feature across models. Feature interaction plots indicated non-linear relationships between statistical measures and fatigue levels. The multi-signal approach consistently outperformed single-signal models, with combinations involving EEG and EMG contributing most significantly to predictive power.
Semantic F1 Scores: Fair Evaluation Under Fuzzy Class Boundaries
Chochlakis, Georgios, Trager, Jackson, Jhaveri, Vedant, Ravichandran, Nikhil, Potamianos, Alexandros, Narayanan, Shrikanth
We propose Semantic F1 Scores, novel evaluation metrics for subjective or fuzzy multi-label classification that quantify semantic relatedness between predicted and gold labels. Unlike the conventional F1 metrics that treat semantically related predictions as complete failures, Semantic F1 incorporates a label similarity matrix to compute soft precision-like and recall-like scores, from which the Semantic F1 scores are derived. Unlike existing similarity-based metrics, our novel two-step precision-recall formulation enables the comparison of label sets of arbitrary sizes without discarding labels or forcing matches between dissimilar labels. By granting partial credit for semantically related but nonidentical labels, Semantic F1 better reflects the realities of domains marked by human disagreement or fuzzy category boundaries. In this way, it provides fairer evaluations: it recognizes that categories overlap, that annotators disagree, and that downstream decisions based on similar predictions lead to similar outcomes. Through theoretical justification and extensive empirical validation on synthetic and real data, we show that Semantic F1 demonstrates greater interpretability and ecological validity. Because it requires only a domain-appropriate similarity matrix, which is robust to misspecification, and not a rigid ontology, it is applicable across tasks and modalities.
AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit
Zhu, Yi, Guimarรฃes, Heitor R., Pimentel, Arthur, Falk, Tiago
With the prevalence of artificial intelligence (AI)-generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, most models are evaluated on a narrow set of datasets, leaving their generalization to real-world conditions uncertain. In this paper, we systematically review 28 existing audio deepfake datasets and present an open-source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across these 28 datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, using a widely adopted pretrained deepfake detector, we present in- and out-of-domain detection results, revealing notable differences across conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gap relative to practical deployment scenarios.
C-QUERI: Congressional Questions, Exchanges, and Responses in Institutions Dataset
Rudra, Manjari, Magleby, Daniel, Sikdar, Sujoy
Questions in political interviews and hearings serve strategic purposes beyond information gathering including advancing partisan narratives and shaping public perceptions. However, these strategic aspects remain understudied due to the lack of large-scale datasets for studying such discourse. Congressional hearings provide an especially rich and tractable site for studying political questioning: Interactions are structured by formal rules, witnesses are obliged to respond, and members with different political affiliations are guaranteed opportunities to ask questions, enabling comparisons of behaviors across the political spectrum. We develop a pipeline to extract question-answer pairs from unstructured hearing transcripts and construct a novel dataset of committee hearings from the 108th--117th Congress. Our analysis reveals systematic differences in questioning strategies across parties, by showing the party affiliation of questioners can be predicted from their questions alone. Our dataset and methods not only advance the study of congressional politics, but also provide a general framework for analyzing question-answering across interview-like settings.
Uncertainty-Aware Knowledge Tracing Models
Mitton, Joshua, Bhattacharyya, Prarthana, Abboud, Ralph, Woodhead, Simon
The main focus of research on Knowledge Tracing (KT) models is on model developments with the aim of improving predictive accuracy. Most of these models make the most incorrect predictions when students choose a distractor, leading to student errors going undetected. We present an approach to add new capabilities to KT models by capturing predictive uncertainty and demonstrate that a larger predictive uncertainty aligns with model incorrect predictions. We show that uncertainty in KT models is informative and that this signal would be pedagogically useful for application in an educational learning platform that can be used in a limited resource setting where understanding student ability is necessary.
VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding
Waheed, Abdul, Wu, Zhen, Alharthi, Dareen, Kim, Seungone, Raj, Bhiksha
Precisely evaluating video understanding models remains challenging: commonly used metrics such as BLEU, ROUGE, and BERTScore fail to capture the fineness of human judgment, while obtaining such judgments through manual evaluation is costly. Recent work has explored using large language models (LLMs) or multimodal LLMs (MLLMs) as evaluators, but their extension to video understanding remains relatively unexplored. In this work, we introduce VideoJudge, a 3B and 7B-sized MLLM judge specialized to evaluate outputs from video understanding models (\textit{i.e.}, text responses conditioned on videos). To train VideoJudge, our recipe builds on the interplay between a generator and an evaluator: the generator is prompted to produce responses conditioned on a target rating, and responses not matching the evaluator's rating are discarded. Across three out of four meta-evaluation benchmarks, VideoJudge-7B outperforms larger MLLM judge baselines such as Qwen2.5-VL (32B and 72B). Notably, we find that LLM judges (Qwen3) models perform worse than MLLM judges (Qwen2.5-VL) and long chain-of-thought reasoning does not improve performance, indicating that providing video inputs is crucial for evaluation of video understanding tasks.