Performance Analysis
Optical Computation-in-Communication enables low-latency, high-fidelity perception in telesurgery
Yang, Rui, Hu, Jiaming, Zheng, Jian-Qing, Lu, Yue-Zhen, Cui, Jian-Wei, Ren, Qun, Yu, Yi-Jie, Wu, John Edward, Wang, Zhao-Yu, Lin, Xiao-Li, Zhang, Dandan, Tang, Mingchu, Masouros, Christos, Liu, Huiyun, Liu, Chin-Pang
Artificial intelligence (AI) holds significant promise for enhancing intraoperative perception and decision-making in telesurgery, where physical separation impairs sensory feedback and control. Despite advances in medical AI and surgical robotics, conventional electronic AI architectures remain fundamentally constrained by the compounded latency from serial processing of inference and communication. This limitation is especially critical in latency-sensitive procedures such as endovascular interventions, where delays over 200 ms can compromise real-time AI reliability and patient safety. Here, we introduce an Optical Computation-in-Communication (OCiC) framework that reduces end-to-end latency significantly by performing AI inference concurrently with optical communication. OCiC integrates Optical Remote Computing Units (ORCUs) directly into the optical communication pathway, with each ORCU experimentally achieving up to 69 tera-operations per second per channel through spectrally efficient two-dimensional photonic convolution. The system maintains ultrahigh inference fidelity within 0.1% of CPU/GPU baselines on classification and coronary angiography segmentation, while intrinsically mitigating cumulative error propagation, a longstanding barrier to deep optical network scalability. We validated the robustness of OCiC through outdoor dark fibre deployments, confirming consistent and stable performance across varying environmental conditions. When scaled globally, OCiC transforms long-haul fibre infrastructure into a distributed photonic AI fabric with exascale potential, enabling reliable, low-latency telesurgery across distances up to 10,000 km and opening a new optical frontier for distributed medical intelligence.
Signature in Code Backdoor Detection, how far are we?
Le, Quoc Hung, Le-Cong, Thanh, Le, Bach, Xu, Bowen
As Large Language Models (LLMs) become increasingly integrated into software development workflows, they also become prime targets for adversarial attacks. Among these, backdoor attacks are a significant threat, allowing attackers to manipulate model outputs through hidden triggers embedded in training data. Detecting such backdoors remains a challenge, and one promising approach is the use of Spectral Signature defense methods that identify poisoned data by analyzing feature representations through eigenvectors. While some prior works have explored Spectral Signatures for backdoor detection in neural networks, recent studies suggest that these methods may not be optimally effective for code models. In this paper, we revisit the applicability of Spectral Signature-based defenses in the context of backdoor attacks on code models. We systematically evaluate their effectiveness under various attack scenarios and defense configurations, analyzing their strengths and limitations. We found that the widely used setting of Spectral Signature in code backdoor detection is often suboptimal. Hence, we explored the impact of different settings of the key factors. We discovered a new proxy metric that can more accurately estimate the actual performance of Spectral Signature without model retraining after the defense.
An LLM-Powered AI Agent Framework for Holistic IoT Traffic Interpretation
Worae, Daniel Adu, Mastorakis, Spyridon
Internet of Things (IoT) networks generate diverse and high-volume traffic that reflects both normal activity and potential threats. Deriving meaningful insight from such telemetry requires cross-layer interpretation of behaviors, protocols, and context rather than isolated detection. This work presents an LLM-powered AI agent framework that converts raw packet captures into structured and semantically enriched representations for interactive analysis. The framework integrates feature extraction, transformer-based anomaly detection, packet and flow summarization, threat intelligence enrichment, and retrieval-augmented question answering. An AI agent guided by a large language model performs reasoning over the indexed traffic artifacts, assembling evidence to produce accurate and human-readable interpretations. Experimental evaluation on multiple IoT captures and six open models shows that hybrid retrieval, which combines lexical and semantic search with reranking, substantially improves BLEU, ROUGE, METEOR, and BERTScore results compared with dense-only retrieval. System profiling further indicates low CPU, GPU, and memory overhead, demonstrating that the framework achieves holistic and efficient interpretation of IoT network traffic.
Dual-attention ResNet outperforms transformers in HER2 prediction on DCE-MRI
Fridman, Naomi, Goldstein, Anat
Breast cancer is the most diagnosed cancer in women, with HER2 status critically guiding treatment decisions. Noninvasive prediction of HER2 status from dynamic contrast-enhanced MRI (DCE-MRI) could streamline diagnostics and reduce reliance on biopsy. However, preprocessing high-dynamic-range DCE-MRI into standardized 8-bit RGB format for pretrained neural networks is nontrivial, and normalization strategy significantly affects model performance. We benchmarked intensity normalization strategies using a Triple-Head Dual-Attention ResNet that processes RGB-fused temporal sequences from three DCE phases. Trained on a multicenter cohort (n=1,149) from the I-SPY trials and externally validated on BreastDCEDL_AMBL (n=43 lesions), our model outperformed transformer-based architectures, achieving 0.75 accuracy and 0.74 AUC on I-SPY test data. N4 bias field correction slightly degraded performance. Without fine-tuning, external validation yielded 0.66 AUC, demonstrating cross-institutional generalizability. These findings highlight the effectiveness of dual-attention mechanisms in capturing transferable spatiotemporal features for HER2 stratification, advancing reproducible deep learning biomarkers in breast cancer imaging.
SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models
Bhattacharjya, Debarun, Ganesan, Balaji, Lee, Junkyu, Marinescu, Radu, Mirylenka, Katsiaryna, Glass, Michael, Shou, Xiao
When does a large language model (LLM) know what it does not know? Uncertainty quantification (UQ) provides measures of uncertainty, such as an estimate of the confidence in an LLM's generated output, and is therefore increasingly recognized as a crucial component of trusted AI systems. Black-box UQ methods do not require access to internal model information from the generating LLM and therefore have numerous real-world advantages, such as robustness to system changes, adaptability to choice of LLM, reduced costs, and computational tractability. In this paper, we investigate the effectiveness of UQ techniques that are primarily but not necessarily entirely black-box, where the consistency between a generated output and other sampled generations is used as a proxy for confidence in its correctness. We propose a high-level non-verbalized similarity-based aggregation framework that subsumes a broad swath of UQ approaches suitable for complex generative tasks, as well as introduce specific novel techniques from the framework that train confidence estimation models using small training sets. Through an empirical study with datasets spanning the diverse tasks of question answering, summarization, and text-to-SQL, we demonstrate that our proposed similarity-based methods can yield better calibrated confidences than baselines.
A Linguistics-Aware LLM Watermarking via Syntactic Predictability
Park, Shinwoo, Park, Hyejin, Ahn, Hyeseon, Han, Yo-Sub
As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthen it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at https://github.com/Shinwoo-Park/stela_watermark.
Role-Aware Multi-modal federated learning system for detecting phishing webpages
Wang, Bo, Khan, Imran, White, Martin, Beloff, Natalia
We present a federated, multi-modal phishing website detector that supports URL, HTML, and IMAGE inputs without binding clients to a fixed modality at inference: any client can invoke any modality head trained elsewhere. Methodologically, we propose role-aware bucket aggregation on top of FedProx, inspired by Mixture-of-Experts and FedMM. We drop learnable routing and use hard gating (selecting the IMAGE/HTML/URL expert by sample modality), enabling separate aggregation of modality-specific parameters to isolate cross-embedding conflicts and stabilize convergence. On TR-OP, the Fusion head reaches Acc 97.5% with FPR 2.4% across two data types; on the image subset (ablation) it attains Acc 95.5% with FPR 5.9%. For text, we use GraphCodeBERT for URLs and an early three-way embedding for raw, noisy HTML. On WebPhish (HTML) we obtain Acc 96.5% / FPR 1.8%; on TR-OP (raw HTML) we obtain Acc 95.1% / FPR 4.6%. Results indicate that bucket aggregation with hard-gated experts enables stable federated training under strict privacy, while improving the usability and flexibility of multi-modal phishing detection.
Regularizing Extrapolation in Causal Inference
Arbour, David, Parikh, Harsh, Niknam, Bijan, Stuart, Elizabeth, Rudolph, Kara, Feller, Avi
Many common estimators in machine learning and causal inference are linear smoothers, where the prediction is a weighted average of the training outcomes. Some estimators, such as ordinary least squares and kernel ridge regression, allow for arbitrarily negative weights, which improve feature imbalance but often at the cost of increased dependence on parametric modeling assumptions and higher variance. By contrast, estimators like importance weighting and random forests (sometimes implicitly) restrict weights to be non-negative, reducing dependence on parametric modeling and variance at the cost of worse imbalance. In this paper, we propose a unified framework that directly penalizes the level of extrapolation, replacing the current practice of a hard non-negativity constraint with a soft constraint and corresponding hyperparameter. We derive a worst-case extrapolation error bound and introduce a novel "bias-bias-variance" tradeoff, encompassing biases due to feature imbalance, model misspecification, and estimator variance; this tradeoff is especially pronounced in high dimensions, particularly when positivity is poor. We then develop an optimization procedure that regularizes this bound while minimizing imbalance and outline how to use this approach as a sensitivity analysis for dependence on parametric modeling assumptions. We demonstrate the effectiveness of our approach through synthetic experiments and a real-world application, involving the generalization of randomized controlled trial estimates to a target population of interest.
A Multi-dimensional Semantic Surprise Framework Based on Low-Entropy Semantic Manifolds for Fine-Grained Out-of-Distribution Detection
Peng, Ningkang, Mao, Yuzhe, Zhang, Yuhao, Qian, Linjin, Yu, Qianfeng, Gu, Yanhui, Chen, Yi, Kong, Li
Out-of-Distribution (OOD) detection is a cornerstone for the safe deployment of AI systems in the open world. However, existing methods treat OOD detection as a binary classification problem, a cognitive flattening that fails to distinguish between semantically close (Near-OOD) and distant (Far-OOD) unknown risks. This limitation poses a significant safety bottleneck in applications requiring fine-grained risk stratification. To address this, we propose a paradigm shift from a conventional probabilistic view to a principled information-theoretic framework. We formalize the core task as quantifying the Semantic Surprise of a new sample and introduce a novel ternary classification challenge: In-Distribution (ID) vs. Near-OOD vs. Far-OOD. The theoretical foundation of our work is the concept of Low-Entropy Semantic Manifolds, which are explicitly structured to reflect the data's intrinsic semantic hierarchy. To construct these manifolds, we design a Hierarchical Prototypical Network. We then introduce the Semantic Surprise Vector (SSV), a universal probe that decomposes a sample's total surprise into three complementary and interpretable dimensions: conformity, novelty, and ambiguity. To evaluate performance on this new task, we propose the Normalized Semantic Risk (nSR), a cost-sensitive metric. Experiments demonstrate that our framework not only establishes a new state-of-the-art (sota) on the challenging ternary task, but its robust representations also achieve top results on conventional binary benchmarks, reducing the False Positive Rate by over 60% on datasets like LSUN.
Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity
Tang, MingZe, Jacob, Jubal Chandy
Recent Vision-Language Models (VLMs) enable zero-shot classification by aligning images and text in a shared space, a promising approach for data-scarce conditions. However, the influence of prompt design on recognizing visually similar categories, such as human postures, is not well understood. This study investigates how prompt specificity affects the zero-shot classification of sitting, standing, and walking/running on a small, 285-image COCO-derived dataset. A suite of modern VLMs, including OpenCLIP, MetaCLIP 2, and SigLip, were evaluated using a three-tiered prompt design that systematically increases linguistic detail. Our findings reveal a compelling, counter-intuitive trend: for the highest-performing models (MetaCLIP 2 and OpenCLIP), the simplest, most basic prompts consistently achieve the best results. Adding descriptive detail significantly degrades performance for instance, MetaCLIP 2's multi-class accuracy drops from 68.8\% to 55.1\% a phenomenon we term "prompt overfitting". Conversely, the lower-performing SigLip model shows improved classification on ambiguous classes when given more descriptive, body-cue-based prompts.