Performance Analysis
RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning
Ji, Deyi, Yang, Yuekui, Wu, Haiyang, Ma, Shaoping, Chen, Tianrun, Zhu, Lanyun
Advertisement (Ad) video violation detection is critical for ensuring platform compliance, but existing methods struggle with precise temporal grounding, noisy annotations, and limited generalization. We propose RAVEN, a novel framework that integrates curriculum reinforcement learning with multimodal large language models (MLLMs) to enhance reasoning and cognitive capabilities for violation detection. RAVEN employs a progressive training strategy, combining precisely and coarsely annotated data, and leverages Group Relative Policy Optimization (GRPO) to develop emergent reasoning abilities without explicit reasoning annotations. Multiple hierarchical sophisticated reward mechanism ensures precise temporal grounding and consistent category prediction. Experiments on industrial datasets and public benchmarks show that RAVEN achieves superior performances in violation category accuracy and temporal interval localization. We also design a pipeline to deploy the RAVEN on the online Ad services, and online A/B testing further validates its practical applicability, with significant improvements in precision and recall. RAVEN also demonstrates strong generalization, mitigating the catastrophic forgetting issue associated with supervised fine-tuning.
Learning to Watermark: A Selective Watermarking Framework for Large Language Models via Multi-Objective Optimization
Wang, Chenrui, Shu, Junyi, Chiu, Billy, Li, Yu, Alharbi, Saleh, Zhang, Min, Li, Jing
The rapid development of LLMs has raised concerns about their potential misuse, leading to various watermarking schemes that typically offer high detectability. However, existing watermarking techniques often face trade-off between watermark detectability and generated text quality. In this paper, we introduce Learning to Watermark (LTW), a novel selective watermarking framework that leverages multi-objective optimization to effectively balance these competing goals. LTW features a lightweight network that adaptively decides when to apply the watermark by analyzing sentence embeddings, token entropy, and current watermarking ratio. Training of the network involves two specifically constructed loss functions that guide the model toward Pareto-optimal solutions, thereby harmonizing watermark detectability and text quality. By integrating LTW with two baseline watermarking methods, our experimental evaluations demonstrate that LTW significantly enhances text quality without compromising detectability. Our selective watermarking approach offers a new perspective for designing watermarks for LLMs and a way to preserve high text quality for watermarks. The code is publicly available at: https://github.com/fattyray/learning-to-watermark
Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models
Abdaljalil, Samir, Serpedin, Erchin, Qaraqe, Khalid, Kurban, Hasan
Large language models (LLMs) often generate reasoning traces that appear coherent but rest on unsupported assumptions, leading to hallucinated conclusions. Prior work mainly addresses factual hallucinations or relies on post-hoc verification, leaving reasoning-induced hallucinations largely unaddressed. We propose Audit-of-Understanding (AoU), a framework that constrains inference to validated premises through three phases: (1) decomposing a query into candidate assumptions, (2) auditing their support, and (3) conditioning inference only on the validated subset. Formally, AoU is \emph{posterior-constrained inference}, connecting to selective prediction and rejection learning. Our contributions are threefold: (i) theoretical guarantees under perfect validation, (ii) excess-risk bounds under imperfect audits, and (iii) tractability analysis. Empirically, AoU improves both accuracy and faithfulness on GSM8K, MultiArith, and SVAMP, achieving up to +30% gains on GSM8K, +45% on MultiArith, and consistent +20--28% improvements on SVAMP over Chain-of-Thought, Self-Consistency, and CoT-Decoding. Code is available at https://anonymous.4open.science/r/audit-of-understanding-E28B.
Robust Pan-Cancer Mitotic Figure Detection with YOLOv12
Bourgade, Raphaรซl, Balezo, Guillaume, Feki, Hana, Monier, Lily, Blons, Matthieu, Blondel, Alice, Loussouarn, Delphine, Vincent-Salomon, Anne, Walter, Thomas
Detecting mitotic figures (MFs) in histopathology images remains a challenging task. Their quantification traditionally relies on the manual identification of "hot spots" by pathologists, followed by visual counting--an approach that is inherently subjective and may not reliably reflect the true prolifer-ative activity of a tumor. With the rise of digital pathology and artificial intelligence, numerous efforts have been made to automate mitosis detection in order to enhance accuracy, reproducibility, and scalability. Among these, the MItosis DOmain Generalization (MIDOG) challenges have emerged as a key benchmark for evaluating the generalizability of detection algorithms under realistic domain shifts. The 2021 edition (1) addressed scanner-induced variability using breast cancer WSIs, while the 2022 edition (2) extended the scope to include multiple tissue types and species, introducing further biological diversity. The 2025 MIDOG challenge (3) builds on these foundations with the most comprehensive mitosis-annotated dataset to date, and introduces two tasks: (1) detecting mitotic figures in arbitrary tumor tissue, and (2) determining whether a mitotic figure is atypical or normal. These tasks represent a significant step toward developing robust mitosis detection systems that generalize across diverse and complex histological conditions. In this work, we present a high-performance detection pipeline based on the YOLOv12 object detection architecture.
Robust Anomaly Detection through Multi-Modal Autoencoder Fusion for Small Vehicle Damage Detection
Khan, Sara, Yรผksel, Mehmed, Kirchner, Frank
Wear and tear detection in fleet and shared vehicle systems is a critical challenge, particularly in rental and car-sharing services, where minor damage, such as dents, scratches, and underbody impacts, often goes unnoticed or is detected too late. Currently, manual inspection methods are the default approach, but are labour-intensive and prone to human error. In contrast, state-of-the-art image-based methods are less reliable when the vehicle is moving, and they cannot effectively capture underbody damage due to limited visual access and spatial coverage. This work introduces a novel multi-modal architecture based on anomaly detection to address these issues. Sensors such as Inertial Measurement Units (IMUs) and microphones are integrated into a compact device mounted on the vehicle's windshield. This approach supports real-time damage detection while avoiding the need for highly resource-intensive sensors. We developed multiple variants of multi-modal autoencoder-based architectures and evaluated them against unimodal and state-of-the-art methods. Our multi-modal ensemble model with pooling achieved the highest performance, with a Receiver Operating Characteristic-Area Under Curve (ROC-AUC) of 92%, demonstrating its effectiveness in real-world applications. This approach can also be extended to other applications, such as improving automotive safety. It can integrate with airbag systems for efficient deployment and help autonomous vehicles by complementing other sensors in collision detection.
"Mirror" Language AI Models of Depression are Criterion-Contaminated
Li, Tong, Hussain, Rasiq, Gupta, Mehak, Oltmanns, Joshua R.
Recent studies show near-perfect language-based predictions of depression scores (R2 = .70), but these "Mirror" models rely on language responses directly from depression assessments to predict depression assessment scores. These methods suffer from criterion contamination that inflate prediction estimates. We compare "Mirror" models to "Non-Mirror" models, which use other external language to predict depression scores. 110 participants completed both structured diagnostic (Mirror condition) and life history (Non-Mirror condition) interviews. LLMs were prompted to predict diagnostic depression scores. As expected, Mirror models were near-perfect. However, Non-Mirror models also displayed prediction sizes considered large in psychology. Further, both Mirror and Non-Mirror predictions correlated with other questionnaire-based depression symptoms at similar sizes, suggesting bias in Mirror models. Topic modeling revealed different theme structures across model types. As language models for depression continue to evolve, incorporating Non-Mirror approaches may support more valid and clinically useful language-based AI applications in psychological assessment.
AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents
Luo, Hanjun, Dai, Shenyu, Ni, Chiming, Li, Xinfeng, Zhang, Guibin, Wang, Kun, Liu, Tongliang, Salam, Hanan
Despite the rapid advancement of LLM-based agents, the reliable evaluation of their safety and security remains a significant challenge. Existing rule-based or LLM-based evaluators often miss dangers in agents' step-by-step actions, overlook subtle meanings, fail to see how small issues compound, and get confused by unclear safety or security rules. To overcome this evaluation crisis, we introduce AgentAuditor, a universal, training-free, memory-augmented reasoning framework that empowers LLM evaluators to emulate human expert evaluators. AgentAuditor constructs an experiential memory by having an LLM adaptively extract structured semantic features (e.g., scenario, risk, behavior) and generate associated chain-of-thought reasoning traces for past interactions. A multi-stage, context-aware retrieval-augmented generation process then dynamically retrieves the most relevant reasoning experiences to guide the LLM evaluator's assessment of new cases. Moreover, we developed ASSEBench, the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats. ASSEBench comprises 2293 meticulously annotated interaction records, covering 15 risk types across 29 application scenarios. A key feature of ASSEBench is its nuanced approach to ambiguous risk situations, employing "Strict" and "Lenient" judgment standards. Experiments demonstrate that AgentAuditor not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state-of-the-art in LLM-as-a-judge for agent safety and security, achieving human-level accuracy. Our work is openly accessible at https://github.com/Astarojth/AgentAuditor.
Exploiting Meta-Learning-based Poisoning Attacks for Graph Link Prediction
Li, Mingchen, Zhuang, Di, Chen, Keyu, Samaraweera, Dumindu, Chang, Morris
Abstract--Link prediction in graph data uses various algorithms and Graph Nerual Network (GNN) models to predict potential relationships between graph nodes. These techniques have found widespread use in numerous real-world applications, including recommendation systems, community/social networks, and biological structures. However, recent research has highlighted the vulnerability of GNN models to adversarial attacks, such as poisoning and evasion attacks. Addressing the vulnerability of GNN models is crucial to ensure stable and robust performance in GNN applications. Although many works have focused on enhancing the robustness of node classification on GNN models, the robustness of link prediction has received less attention. T o bridge this gap, this article introduces an unweighted graph poisoning attack that leverages meta-learning with weighted scheme strategies to degrade the link prediction performance of GNNs. We conducted comprehensive experiments on diverse datasets across multiple link prediction applications to evaluate the proposed method and its parameters, comparing it with existing approaches under similar conditions. Our results demonstrate that our approach significantly reduces link prediction performance and consistently outperforms other state-of-the-art baselines. Many real-world datasets such as social networks [1], traffic networks [2], biological structures [3], and e-commerce networks [4] can be represented as graphs containing nodes, links, and associated features. The structure of these graphs is dynamic, constantly changing as relationships evolve. Capturing these changes is fundamental to the success of various real-world scenarios, such as community detection and recommendation systems [5].
Handling Extreme Class Imbalance: Using GANs in Data Augmentation for Suicide Prediction
Visweswaraiah, Vaishnavi, Banerjee, Tanvi, Romine, William
Suicide prediction is the key for prevention, but real data with sufficient positive samples is rare and causes extreme class imbalance. We utilized machine learning (ML) to build the model and deep learning (DL) techniques, like Generative Adversarial Networks (GAN), to generate synthetic data samples to enhance the dataset. The initial dataset contained 656 samples, with only four positive cases, prompting the need for data augmentation. A variety of machine learning models, ranging from interpretable data models to black box algorithmic models, were used. On real test data, Logistic Regression (LR) achieved a weighted precision of 0.99, a weighted recall of 0.85, and a weighted F1 score of 0.91; Random Forest (RF) showed 0.98, 0.99, and 0.99, respectively; and Support Vector Machine (SVM) achieved 0.99, 0.76, and 0.86. LR and SVM correctly identified one suicide attempt case (sensitivity:1.0) and misclassified LR(20) and SVM (31) non-attempts as attempts (specificity: 0.85 & 0.76, respectively). RF identified 0 suicide attempt cases (sensitivity: 0.0) with 0 false positives (specificity: 1.0). These results highlight the models' effectiveness, with GAN playing a key role in generating synthetic data to support suicide prevention modeling efforts.
Formally Exploring Time-Series Anomaly Detection Evaluation Metrics
Wagner, Dennis, Nair, Arjun, Franks, Billy Joe, Arweiler, Justus, Muraleedharan, Aparna, Jungjohann, Indra, Hartung, Fabian, Ahuja, Mayank C., Balinskyy, Andriy, Varshneya, Saurabh, Syed, Nabeel Hussain, Nagda, Mayank, Liznerski, Phillip, Reithermann, Steffen, Rudolph, Maja, Vollmer, Sebastian, Schulz, Ralf, Katz, Torsten, Mandt, Stephan, Bortz, Michael, Leitte, Heike, Neider, Daniel, Burger, Jakob, Jirasek, Fabian, Hasse, Hans, Fellenz, Sophie, Kloft, Marius
Undetected anomalies in time series can trigger catastrophic failures in safety-critical systems, such as chemical plant explosions or power grid outages. Although many detection methods have been proposed, their performance remains unclear because current metrics capture only narrow aspects of the task and often yield misleading results. We address this issue by introducing verifiable properties that formalize essential requirements for evaluating time-series anomaly detection. These properties enable a theoretical framework that supports principled evaluations and reliable comparisons. Analyzing 37 widely used metrics, we show that most satisfy only a few properties, and none satisfy all, explaining persistent inconsistencies in prior results. To close this gap, we propose LARM, a flexible metric that provably satisfies all properties, and extend it to ALARM, an advanced variant meeting stricter requirements.