Performance Analysis
Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems
Agentic artificial intelligence (AI) -- multi-agent systems that combine large language models with external tools and autonomous planning -- are rapidly transitioning from research laboratories into high-stakes domains. Our earlier "Basic" paper introduced a five-axis framework and proposed preliminary metrics such as goal drift and harm reduction but did not provide an algorithmic instantiation or empirical evidence. This "Advanced" sequel fills that gap. First, we revisit recent benchmarks and industrial deployments to show that technical metrics still dominate evaluations: a systematic review of 84 papers from 2023--2025 found that 83% report capability metrics while only 30% consider human-centred or economic axes [2]. Second, we formalise an Adaptive Multi-Dimensional Monitoring (AMDM) algorithm that normalises heterogeneous metrics, applies per-axis exponentially weighted moving-average thresholds and performs joint anomaly detection via the Mahalanobis distance [7]. Third, we conduct simulations and real-world experiments. AMDM cuts anomaly-detection latency from 12.3 s to 5.6 s on simulated goal drift and reduces false-positive rates from 4.5% to 0.9% compared with static thresholds. We present a comparison table and ROC/PR curves, and we reanalyse case studies to surface missing metrics. Code, data and a reproducibility checklist accompany this paper to facilitate replication. The code supporting this work is available at https://github.com/Manishms18/Adaptive-Multi-Dimensional-Monitoring.
SigmaScheduling: Uncertainty-Informed Scheduling of Decision Points for Intelligent Mobile Health Interventions
Gazi, Asim H., Gullapalli, Bhanu Teja, Gao, Daiqi, Marlin, Benjamin M., Shetty, Vivek, Murphy, Susan A.
Timely decision making is critical to the effectiveness of mobile health (mHealth) interventions. At predefined timepoints called "decision points," intelligent mHealth systems such as just-in-time adaptive interventions (JITAIs) estimate an individual's biobehavioral context from sensor or survey data and determine whether and how to intervene. For interventions targeting habitual behavior (e.g., oral hygiene), effectiveness often hinges on delivering support shortly before the target behavior is likely to occur. Current practice schedules decision points at a fixed interval (e.g., one hour) before user-provided behavior times, and the fixed interval is kept the same for all individuals. However, this one-size-fits-all approach performs poorly for individuals with irregular routines, often scheduling decision points after the target behavior has already occurred, rendering interventions ineffective. In this paper, we propose SigmaScheduling, a method to dynamically schedule decision points based on uncertainty in predicted behavior times. When behavior timing is more predictable, SigmaScheduling schedules decision points closer to the predicted behavior time; when timing is less certain, SigmaScheduling schedules decision points earlier, increasing the likelihood of timely intervention. We evaluated SigmaScheduling using real-world data from 68 participants in a 10-week trial of Oralytics, a JITAI designed to improve daily toothbrushing. SigmaScheduling increased the likelihood that decision points preceded brushing events in at least 70% of cases, preserving opportunities to intervene and impact behavior. Our results indicate that SigmaScheduling can advance precision mHealth, particularly for JITAIs targeting time-sensitive, habitual behaviors such as oral hygiene or dietary habits.
Towards Edge-Based Idle State Detection in Construction Machinery Using Surveillance Cameras
Kรผpers, Xander, Brinke, Jeroen Klein, Bemthuis, Rob, Incel, Ozlem Durmaz
The construction industry faces significant challenges in optimizing equipment utilization, as underused machinery leads to increased operational costs and project delays. Accurate and timely monitoring of equipment activity is therefore key to identifying idle periods and improving overall efficiency. This paper presents the Edge-IMI framework for detecting idle construction machinery, specifically designed for integration with surveillance camera systems. The proposed solution consists of three components: object detection, tracking, and idle state identification, which are tailored for execution on resource-constrained, CPU-based edge computing devices. The performance of Edge-IMI is evaluated using a combined dataset derived from the ACID and MOCS benchmarks. Experimental results confirm that the object detector achieves an F1 score of 71.75%, indicating robust real-world detection capabilities. The logistic regression-based idle identification module reliably distinguishes between active and idle machinery with minimal false positives. Integrating all three modules, Edge-IMI enables efficient on-site inference, reducing reliance on high-bandwidth cloud services and costly hardware accelerators. We also evaluate the performance of object detection models on Raspberry Pi 5 and an Intel NUC platforms, as example edge computing platforms. We assess the feasibility of real-time processing and the impact of model optimization techniques.
Multilingual Collaborative Defense for Large Language Models
Li, Hongliang, Xu, Jinan, Cui, Gengping, Guan, Changhao, Mo, Fengran, Huang, Kaiyu
The robustness and security of large language models (LLMs) has become a prominent research area. One notable vulnerability is the ability to bypass LLM safeguards by translating harmful queries into rare or underrepresented languages, a simple yet effective method of "jailbreaking" these models. Despite the growing concern, there has been limited research addressing the safeguarding of LLMs in multilingual scenarios, highlighting an urgent need to enhance multilingual safety. In this work, we investigate the correlation between various attack features across different languages and propose Multilingual Collaborative Defense (MCD), a novel learning method that optimizes a continuous, soft safety prompt automatically to facilitate multilingual safeguarding of LLMs. The MCD approach offers three advantages: First, it effectively improves safeguarding performance across multiple languages. Second, MCD maintains strong generalization capabilities while minimizing false refusal rates. Third, MCD mitigates the language safety misalignment caused by imbalances in LLM training corpora. To evaluate the effectiveness of MCD, we manually construct multilingual versions of commonly used jailbreak benchmarks, such as MaliciousInstruct and AdvBench, to assess various safeguarding methods. Additionally, we introduce these datasets in underrepresented (zero-shot) languages to verify the language transferability of MCD. The results demonstrate that MCD outperforms existing approaches in safeguarding against multilingual jailbreak attempts while also exhibiting strong language transfer capabilities. Our code is available at https://github.com/HLiang-Lee/MCD.
Data distribution impacts the performance and generalisability of contrastive learning-based foundation models of electrocardiograms
Khattak, Gul Rukh, Patlatzoglou, Konstantinos, Barker, Joseph, Pastika, Libor, Zeidaabadi, Boroumand, El-Medany, Ahmed, Aggour, Hesham, Liang, Yixiu, Ribeiro, Antonio H., Annis, Jeffrey, Ribeiro, Antonio Luiz Pinho, Ge, Junbo, Kramer, Daniel B., Waks, Jonathan W., Brittain, Evan, Peters, Nicholas, Ng, Fu Siong, Sau, Arunashis
Department of Cardiology, Imperial College Healthcare NHS Trust, London, United Kingdom Disclosures: JWW and DBK were previously on the advisory board for Heartcor solutions LLC, forwhom they remain independent consultants. JWW reports research funding fromAnumana and is a consultant for HeartBeam Inc. FSN reports speaker fees from GEhealthcare and is on the advisory board for Astra Zeneca. The remaining authorshave no conflicts to declare. Heart and Lung Institute, Imperial College LondonHammersmith Campus Du Cane RoadLondon W12 0NN Abstract Contrastive learning is a widely adopted self-supervised pretraining strategy, yet itsdependence on cohort composition remains underexplored. We systematically assess how cohort demographics,health status, and population diversity influence the downstream performance forprediction tasks also including two additional cohorts from another continent (Europe).We find that downstream performance depends on the distributional properties of thepretraining cohort, including demographics and health status. Moreover, whilepretraining with a multi-centre, demographically diverse cohort improves in-distributionaccuracy, it reduces out-of-distribution (OOD) generalisation of our contrastiveapproach by encoding cohort-specific artifacts.
Uncertainty-Aware Tabular Prediction: Evaluating VBLL-Enhanced TabPFN in Safety-Critical Medical Data
Predictive models are being increasingly used across a wide range of domains, including safety-critical applications such as medical diagnosis and criminal justice. Reliable uncertainty estimation is a crucial task in such settings. Tabular Prior-data Fitted Network (TabPFN) is a recently proposed machine learning foundation model for tabular dataset, which uses a generative transformer architecture. Variational Bayesian Last Layers (VBLL) is a state-of-the-art lightweight variational formulation that effectively improves uncertainty estimation with minimal computational overhead. In this work we aim to evaluate the performance of VBLL integrated with the recently proposed TabPFN in uncertainty calibration. Our experiments, conducted on three benchmark medical tabular datasets, compare the performance of the original TabPFN and the VBLL-integrated version. Contrary to expectations, we observed that original TabPFN consistently outperforms VBLL integrated TabPFN in uncertainty calibration across all datasets.
Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees
Zeighami, Sepanta, Shankar, Shreya, Parameswaran, Aditya
Large Language Models (LLMs) are being increasingly used as a building block in data systems to process large text datasets. To do so, LLM model providers offer multiple LLMs with different sizes, spanning various cost-quality trade-offs when processing text at scale. Top-of-the-line LLMs (e.g., GPT-4o, Claude Sonnet) operate with high accuracy but are prohibitively expensive when processing many records. To avoid high costs, more affordable but lower quality LLMs (e.g., GPT-4o-mini, Claude Haiku) can be used to process records, but we need to ensure that the overall accuracy does not deviate substantially from that of the top-of-the-line LLMs. The model cascade framework provides a blueprint to manage this trade-off, by using the confidence of LLMs in their output (e.g., log-probabilities) to decide on which records to use the affordable LLM. However, existing solutions following this framework provide only marginal cost savings and weak theoretical guarantees because of poor estimation of the quality of the affordable LLM's outputs. We present BARGAIN, a method that judiciously uses affordable LLMs in data processing to significantly reduce cost while providing strong theoretical guarantees on the solution quality. BARGAIN employs a novel adaptive sampling strategy and statistical estimation procedure that uses data and task characteristics and builds on recent statistical tools to make accurate estimations with tight theoretical guarantees. Variants of BARGAIN can support guarantees on accuracy, precision, or recall of the output. Experimental results across 8 real-world datasets show that BARGAIN reduces cost, on average, by up to 86% more than state-of-the-art, while providing stronger theoretical guarantees on accuracy of output, with similar gains when guaranteeing a desired level of precision or recall.
A Dataset for Distilling Knowledge Priors from Literature for Therapeutic Design
Jones, Haydn Thomas, Maus, Natalie, Ludan, Josh Magnus, Huan, Maggie Ziyu, Liang, Jiaming, Torres, Marcelo Der Torossian, Liang, Jiatao, Ives, Zachary, Barash, Yoseph, de la Fuente-Nunez, Cesar, Gardner, Jacob R., Yatskar, Mark
AI-driven discovery can greatly reduce design time and enhance new therapeutics' effectiveness. Models using simulators explore broad design spaces but risk violating implicit constraints due to a lack of experimental priors. For example, in a new analysis we performed on a diverse set of models on the GuacaMol benchmark using supervised classifiers, over 60\% of molecules proposed had high probability of being mutagenic. In this work, we introduce Medex, a dataset of priors for design problems extracted from literature describing compounds used in lab settings. It is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. Medex consists of 32.3 million pairs of natural language facts, and appropriate entity representations (i.e. SMILES or refseq IDs). To demonstrate the potential of the data, we train LLM, CLIP, and LLava architectures to reason jointly about text and design targets and evaluate on tasks from the Therapeutic Data Commons (TDC). Medex is highly effective for creating models with strong priors: in supervised prediction problems that use our data as pretraining, our best models with 15M learnable parameters outperform larger 2B TxGemma on both regression and classification TDC tasks, and perform comparably to 9B models on average. Models built with Medex can be used as constraints while optimizing for novel molecules in GuacaMol, resulting in proposals that are safer and nearly as effective. We release our dataset at https://huggingface.co/datasets/medexanon/Medex, and will provide expanded versions as available literature grows.
The Precautionary Principle and the Innovation Principle: Incompatible Guides for AI Innovation Governance?
In policy debates concerning the governance and regulation of Artificial Intelligence (AI), both the Precautionary Principle (PP) and the Innovation Principle (IP) are advocated by their respective interest groups. Do these principles offer wholly incompatible and contradictory guidance? Does one necessarily negate the other? I argue here that provided attention is restricted to weak-form PP and IP, the answer to both of these questions is "No." The essence of these weak formulations is the requirement to fully account for type-I error costs arising from erroneously preventing the innovation's diffusion through society (i.e. mistaken regulatory red-lighting) as well as the type-II error costs arising from erroneously allowing the innovation to diffuse through society (i.e. mistaken regulatory green-lighting). Within the Signal Detection Theory (SDT) model developed here, weak-PP red-light (weak-IP green-light) determinations are optimal for sufficiently small (large) ratios of expected type-I to type-II error costs. For intermediate expected cost ratios, an amber-light 'wait-and-monitor' policy is optimal. Regulatory sandbox instruments allow AI testing and experimentation to take place within a structured environment of limited duration and societal scale, whereby the expected cost ratio falls within the 'wait-and-monitor' range. Through sandboxing regulators and innovating firms learn more about the expected cost ratio, and what respective adaptations -- of regulation, of technical solution, of business model, or combination thereof, if any -- are needed to keep the ratio out of the weak-PP red-light zone. Nevertheless AI foundation models are ill-suited for regulatory sandboxing as their general-purpose nature precludes credible identification of misclassification costs.
Self-supervised Learning Of Visual Pose Estimation Without Pose Labels By Classifying LED States
Carlotti, Nicholas, Nava, Mirko, Giusti, Alessandro
We introduce a model for monocular RGB relative pose estimation of a ground robot that trains from scratch without pose labels nor prior knowledge about the robot's shape or appearance. At training time, we assume: (i) a robot fitted with multiple LEDs, whose states are independent and known at each frame; (ii) knowledge of the approximate viewing direction of each LED; and (iii) availability of a calibration image with a known target distance, to address the ambiguity of monocular depth estimation. Training data is collected by a pair of robots moving randomly without needing external infrastructure or human supervision. Our model trains on the task of predicting from an image the state of each LED on the robot. In doing so, it learns to predict the position of the robot in the image, its distance, and its relative bearing. At inference time, the state of the LEDs is unknown, can be arbitrary, and does not affect the pose estimation performance. Quantitative experiments indicate that our approach: is competitive with SoA approaches that require supervision from pose labels or a CAD model of the robot; generalizes to different domains; and handles multi-robot pose estimation.