Performance Analysis
Thai Semantic End-of-Turn Detection for Real-Time Voice Agents
Popit, Thanapol, Rungseesiripak, Natthapath, Charattrakool, Monthol, Ruangtanusak, Saksorn
Fluid voice-to-voice interaction requires reliable and low-latency detection of when a user has finished speaking. Traditional audio-silence end-pointers add hundreds of milliseconds of delay and fail under hesitations or language-specific phenomena. We present, to our knowledge, the first systematic study of Thai text-only end-of-turn (EOT) detection for real-time agents. We compare zero-shot and few-shot prompting of compact LLMs to supervised fine-tuning of lightweight transformers. Using transcribed subtitles from the YODAS corpus and Thai-specific linguistic cues (e.g., sentence-final particles), we formulate EOT as a binary decision over token boundaries. We report a clear accuracy-latency tradeoff and provide a public-ready implementation plan. This work establishes a Thai baseline and demonstrates that small, fine-tuned models can deliver near-instant EOT decisions suitable for on-device agents.
Optimizing Resources for On-the-Fly Label Estimation with Multiple Unknown Medical Experts
Bary, Tim, Godelaine, Tiffanie, Abels, Axel, Macq, Benoรฎt
Accurate ground truth estimation in medical screening programs often relies on coalitions of experts and peer second opinions. Algorithms that efficiently aggregate noisy annotations can enhance screening workflows, particularly when data arrive continuously and expert proficiency is initially unknown. However, existing algorithms do not meet the requirements for seamless integration into screening pipelines. We therefore propose an adaptive approach for real-time annotation that (I) supports on-the-fly labeling of incoming data, (II) operates without prior knowledge of medical experts or pre-labeled data, and (III) dynamically queries additional experts based on the latent difficulty of each instance. The method incrementally gathers expert opinions until a confidence threshold is met, providing accurate labels with reduced annotation overhead. We evaluate our approach on three multi-annotator classification datasets across different modalities. Results show that our adaptive querying strategy reduces the number of expert queries by up to 50% while achieving accuracy comparable to a non-adaptive baseline. Our code is available at https://github.com/tbary/MEDICS
On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection
He, Weiqing, Li, Xiang, Shang, Tianqi, Shen, Li, Su, Weijie, Long, Qi
Large language models (LLMs) raise concerns about content authenticity and integrity because they can generate human-like text at scale. Text watermarks, which embed detectable statistical signals into generated text, offer a provable way to verify content origin. Many detection methods rely on pivotal statistics that are i.i.d. under human-written text, making goodness-of-fit (GoF) tests a natural tool for watermark detection. However, GoF tests remain largely underexplored in this setting. In this paper, we systematically evaluate eight GoF tests across three popular watermarking schemes, using three open-source LLMs, two datasets, various generation temperatures, and multiple post-editing methods. We find that general GoF tests can improve both the detection power and robustness of watermark detectors. Notably, we observe that text repetition, common in low-temperature settings, gives GoF tests a unique advantage not exploited by existing methods. Our results highlight that classic GoF tests are a simple yet powerful and underused tool for watermark detection in LLMs.
WAFFLE: A Wearable Approach to Bite Timing Estimation in Robot-Assisted Feeding
Padmanabha, Akhil, Yuan, Jessie, Mehta, Tanisha, Jenamani, Rajat Kumar, Hu, Eric, de Leรณn, Victoria, Wertz, Anthony, Gupta, Janavi, Dodson, Ben, Yan, Yunting, Majidi, Carmel, Bhattacharjee, Tapomayukh, Erickson, Zackory
Millions of people around the world need assistance with feeding. Robotic feeding systems offer the potential to enhance autonomy and quality of life for individuals with impairments and reduce caregiver workload. However, their widespread adoption has been limited by technical challenges such as estimating bite timing, the appropriate moment for the robot to transfer food to a user's mouth. In this work, we introduce WAFFLE: Wearable Approach For Feeding with LEarned bite timing, a system that accurately predicts bite timing by leveraging wearable sensor data to be highly reactive to natural user cues such as head movements, chewing, and talking. We train a supervised regression model on bite timing data from 14 participants and incorporate a user-adjustable assertiveness threshold to convert predictions into proceed or stop commands. In a study with 15 participants without motor impairments with the Obi feeding robot, WAFFLE performs statistically on par with or better than baseline methods across measures of feeling of control, robot understanding, and workload, and is preferred by the majority of participants for both individual and social dining. We further demonstrate WAFFLE's generalizability in a study with 2 participants with motor impairments in their home environments using a Kinova 7DOF robot. Our findings support WAFFLE's effectiveness in enabling natural, reactive bite timing that generalizes across users, robot hardware, robot positioning, feeding trajectories, foods, and both individual and social dining contexts.
Adaptive and Explainable AI Agents for Anomaly Detection in Critical IoT Infrastructure using LLM-Enhanced Contextual Reasoning
Ensuring that critical IoT systems function safely and smoothly depends a lot on finding anomalies quickly. As more complex systems, like smart healthcare, energy grids and industrial automation, appear, it is easier to see the shortcomings of older methods of detection. Monitoring failures usually happen in dynamic, high dimensional situations, especially when data is incomplete, messy or always evolving. Such limits point out the requirement for adaptive, intelligent systems that always improve and think. LLMs are now capable of significantly changing how context is understood and semantic inference is done across all types of data. This proposal suggests using an LLM supported contextual reasoning method along with XAI agents to improve how anomalies are found in significant IoT environments. To discover hidden patterns and notice inconsistencies in data streams, it uses attention methods, avoids dealing with details from every time step and uses memory buffers with meaning. Because no code AI stresses transparency and interpretability, people can check and accept the AI's decisions, helping ensure AI follows company policies. The two architectures are put together in a test that compares the results of the traditional model with those of the suggested LLM enhanced model. Important measures to check are the accuracy of detection, how much inaccurate information is included in the results, how clearly the findings can be read and how fast the system responds under different test situations. The metaheuristic is tested in simulations of real world smart grid and healthcare contexts to check its adaptability and reliability. From the study, we see that the new approach performs much better than most existing models in both accuracy and interpretation, so it could be a good fit for future anomaly detection tasks in IoT
Bridging the Gap Between Multimodal Foundation Models and World Models
Humans understand the world through the integration of multiple sensory modalities, enabling them to perceive, reason about, and imagine dynamic physical processes. Inspired by this capability, multimodal foundation models (MFMs) have emerged as powerful tools for multimodal understanding and generation. However, today's MFMs fall short of serving as effective world models. They lack the essential ability such as perform counterfactual reasoning, simulate dynamics, understand the spatiotemporal information, control generated visual outcomes, and perform multifaceted reasoning. We investigates what it takes to bridge the gap between multimodal foundation models and world models. We begin by improving the reasoning capabilities of MFMs through discriminative tasks and equipping MFMs with structured reasoning skills, such as causal inference, counterfactual thinking, and spatiotemporal reasoning, enabling them to go beyond surface correlations and understand deeper relationships within visual and textual data. Next, we explore generative capabilities of multimodal foundation models across both image and video modalities, introducing new frameworks for structured and controllable generation. Our approaches incorporate scene graphs, multimodal conditioning, and multimodal alignment strategies to guide the generation process, ensuring consistency with high-level semantics and fine-grained user intent. We further extend these techniques to controllable 4D generation, enabling interactive, editable, and morphable object synthesis over time and space.
GAS-MIL: Group-Aggregative Selection Multi-Instance Learning for Ensemble of Foundation Models in Digital Pathology Image Analysis
Quan, Peiran, Gu, Zifan, Zhao, Zhuo, Zhou, Qin, Yang, Donghan M., Rong, Ruichen, Xie, Yang, Xiao, Guanghua
Foundation models (FMs) have transformed computational pathology by providing powerful, general - purpose feature extractors. However, adapting and benchmarking individual FMs for specific diagnostic tasks is often time - consuming and resource - intensive, espe cially given their scale and diversity. To address this challenge, we introduce Group - Aggregative Selection Multi - Instance Learning (GAS - MIL), a flexible ensemble framework that seamlessly integrates features from multiple FMs, preserving their complementa ry strengths without requiring manual feature selection or extensive task - specific fine - tuning. Across classification tasks in three cancer datasets -- prostate (PANDA), ovarian (UBC - OCEAN), and breast (TCGA - BrCa) -- GAS - MIL consistently achieves superior or on - par performance relative to individual FMs and established MIL methods, demonstrating its robustness and generalizability. By enabling efficient int egration of heterogeneous FMs, GAS - MIL streamlines model deployment for pathology and provides a scalable foundation for future multimodal and precision oncology applications.
Real-Time Threaded Houbara Detection and Segmentation for Wildlife Conservation using Mobile Platforms
Saoud, Lyes Saad, Lesobre, Loic, Sorato, Enrico, Hussain, Irfan
Real-time animal detection and segmentation in natural environments are vital for wildlife conservation, enabling non-invasive monitoring through remote camera streams. However, these tasks remain challenging due to limited computational resources and the cryptic appearance of many species. We propose a mobile-optimized two-stage deep learning framework that integrates a Threading Detection Model (TDM) to parallelize YOLOv10-based detection and MobileSAM-based segmentation. Unlike prior YOLO+SAM pipelines, our approach improves real-time performance by reducing latency through threading. YOLOv10 handles detection while MobileSAM performs lightweight segmentation, both executed concurrently for efficient resource use. On the cryptic Houbara Bustard, a conservation-priority species, our model achieves mAP50 of 0.9627, mAP75 of 0.7731, mAP95 of 0.7178, and a MobileSAM mIoU of 0.7421. YOLOv10 operates at 43.7 ms per frame, confirming real-time readiness. We introduce a curated Houbara dataset of 40,000 annotated images to support model training and evaluation across diverse conditions. The code and dataset used in this study are publicly available on GitHub at https://github.com/LyesSaadSaoud/mobile-houbara-detseg. For interactive demos and additional resources, visit https://lyessaadsaoud.github.io/LyesSaadSaoud-Threaded-YOLO-SAM-Houbara.
Reasoning-based Anomaly Detection Framework: A Real-time, Scalable, and Automated Approach to Anomaly Detection Across Domains
Panwar, Anupam, Pal, Himadri, Chen, Jiali, Cho, Kyle, Jiang, Riddick, Zhao, Miao, Krishnamurthy, Rajiv
Detecting anomalies in large, distributed systems presents several challenges. The first challenge arises from the sheer volume of data that needs to be processed. Flagging anomalies in a high-throughput environment calls for a careful consideration of both algorithm and system design. The second challenge comes from the heterogeneity of time-series datasets that leverage such a system in production. In practice, anomaly detection systems are rarely deployed for a single use case. Typically, there are several metrics to monitor, often across several domains (e.g. engineering, business and operations). A one-size-fits-all approach rarely works, so these systems need to be fine-tuned for every application - this is often done manually. The third challenge comes from the fact that determining the root-cause of anomalies in such settings is akin to finding a needle in a haystack. Identifying (in real time) a time-series dataset that is associated causally with the anomalous time-series data is a very difficult problem. In this paper, we describe a unified framework that addresses these challenges. Reasoning based Anomaly Detection Framework (RADF) is designed to perform real time anomaly detection on very large datasets. This framework employs a novel technique (mSelect) that automates the process of algorithm selection and hyper-parameter tuning for each use case. Finally, it incorporates a post-detection capability that allows for faster triaging and root-cause determination. Our extensive experiments demonstrate that RADF, powered by mSelect, surpasses state-of-the-art anomaly detection models in AUC performance for 5 out of 9 public benchmarking datasets. RADF achieved an AUC of over 0.85 for 7 out of 9 datasets, a distinction unmatched by any other state-of-the-art model.
Pilot selection in the era of Virtual reality: algorithms for accurate and interpretable machine learning models
Ke, Luoma, Zhang, Guangpeng, He, Jibo, Li, Yajing, Li, Yan, Liu, Xufeng, Fang, Peng
With the rapid growth of the aviation industry, there is a need for a large number of flight crew. How to select the right pilots in a cost-efficient manner has become an important research question. In the current study, twenty-three pilots were recruited from China Eastern Airlines, and 23 novices were from the community of Tsinghua University. A novel approach incorporating machine learning and virtual reality technology was applied to distinguish features between these participants with different flight skills. Results indicate that SVM with the MIC feature selection method consistently achieved the highest prediction performance on all metrics with an Accuracy of 0.93, an AUC of 0.96, and an F1 of 0.93, which outperforms four other classifier algorithms and two other feature selection methods. From the perspective of feature selection methods, the MIC method can select features with a nonlinear relationship to sampling labels, instead of a simple filter-out. Our new implementation of the SVM + MIC algorithm outperforms all existing pilot selection algorithms and perhaps provides the first implementation based on eye tracking and flight dynamics data. This study's VR simulation platforms and algorithms can be used for pilot selection and training.