Performance Analysis
SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
Jing, Hongyi, Chen, Jiafu, Rao, Chen, Dang, Ziqiang, Teng, Jiajie, Chu, Tianyi, Mo, Juncheng, Fang, Shuo, Lin, Huaizhong, Lv, Rui, Ma, Chenguang, Zhao, Lei
The existing Multimodal Large Language Models (MLLMs) for GUI perception have made great progress. However, the following challenges still exist in prior methods: 1) They model discrete coordinates based on text autoregressive mechanism, which results in lower grounding accuracy and slower inference speed. 2) They can only locate predefined sets of elements and are not capable of parsing the entire interface, which hampers the broad application and support for downstream tasks. To address the above issues, we propose SparkUI-Parser, a novel end-to-end framework where higher localization precision and fine-grained parsing capability of the entire interface are simultaneously achieved. Specifically, instead of using probability-based discrete modeling, we perform continuous modeling of coordinates based on a pre-trained Multimodal Large Language Model (MLLM) with an additional token router and coordinate decoder. This effectively mitigates the limitations inherent in the discrete output characteristics and the token-by-token generation process of MLLMs, consequently boosting both the accuracy and the inference speed. To further enhance robustness, a rejection mechanism based on a modified Hungarian matching algorithm is introduced, which empowers the model to identify and reject non-existent elements, thereby reducing false positives. Moreover, we present ScreenParse, a rigorously constructed benchmark to systematically assess structural perception capabilities of GUI models across diverse scenarios. Extensive experiments demonstrate that our approach consistently outperforms SOTA methods on ScreenSpot, ScreenSpot-v2, CAGUI-Grounding and ScreenParse benchmarks. The resources are available at https://github.com/antgroup/SparkUI-Parser.
Bootstrapping Reinforcement Learning with Sub-optimal Policies for Autonomous Driving
Zhang, Zhihao, Peng, Chengyang, Yurtsever, Ekim, Redmill, Keith A.
Automated vehicle control using reinforcement learning (RL) has attracted significant attention due to its potential to learn driving policies through environment interaction. However, RL agents often face training challenges in sample efficiency and effective exploration, making it difficult to discover an optimal driving strategy. To address these issues, we propose guiding the RL driving agent with a demonstration policy that need not be a highly optimized or expert-level controller. Specifically, we integrate a rule-based lane change controller with the Soft Actor Critic (SAC) algorithm to enhance exploration and learning efficiency. Our approach demonstrates improved driving performance and can be extended to other driving scenarios that can similarly benefit from demonstration-based guidance.
Why Language Models Hallucinate
Kalai, Adam Tauman, Nachum, Ofir, Vempala, Santosh S., Zhang, Edwin
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
i-Mask: An Intelligent Mask for Breath-Driven Activity Recognition
Sinha, Ashutosh Kumar, Patel, Ayush, Dudhat, Mitul, Anand, Pritam, Mishra, Rahul
Human activity recognition (HAR) has gained significant attention due to its applications in health monitoring, intelligent environments, and human-computer interaction Hussain, Khan, Khan, Bhatt, Farouk, Bhola, and Baik (2024); Mishra, Gupta, Gupta, and Dutta (2022). Traditional HAR approaches employed wearable inertial sensors, vision-based methods, and environmental sensors for HAR. However, each method has inherent limitations such as discomfort, privacy concerns, or complex deployment requirements Wang, Huang, Zhao, Zhu, Huang, and Wu (2024); Mishra and Gupta (2025). The human body engages with its environment in diverse ways, one of which is the interaction between the lungs and the external environment through the act of breathing via the nose. The breathing pattern encompasses plenty of useful information that can be processed to fetch different behaviours and health information Mongelli, Orani, Cambiaso, Vaccari, Paglialonga, Braido, and Catalano (2020); Zhang, Wang, and Li (2024). Moreover, the breathing patterns are influenced by metabolic and physiological factors, offering a non-invasive and unobtrusive means of HAR.
Q-SafeML: Safety Assessment of Quantum Machine Learning via Quantum Distance Metrics
Dunn, Oliver, Aslansefat, Koorosh, Papadopoulos, Yiannis
The rise of machine learning in safety-critical systems has paralleled advancements in quantum computing, leading to the emerging field of Quantum Machine Learning (QML). While safety monitoring has progressed in classical ML, existing methods are not directly applicable to QML due to fundamental differences in quantum computation. Given the novelty of QML, dedicated safety mechanisms remain underdeveloped. This paper introduces Q-SafeML, a safety monitoring approach for QML. The method builds on SafeML, a recent method that utilizes statistical distance measures to assess model accuracy and provide confidence in the reasoning of an algorithm. An adapted version of Q-SafeML incorporates quantum-centric distance measures, aligning with the probabilistic nature of QML outputs. This shift to a model-dependent, post-classification evaluation represents a key departure from classical SafeML, which is dataset-driven and classifier-agnostic. The distinction is motivated by the unique representational constraints of quantum systems, requiring distance metrics defined over quantum state spaces. Q-SafeML detects distances between operational and training data addressing the concept drifts in the context of QML. Experiments on QCNN and VQC Models show that this enables informed human oversight, enhancing system transparency and safety.
Hierarchical Section Matching Prediction (HSMP) BERT for Fine-Grained Extraction of Structured Data from Hebrew Free-Text Radiology Reports in Crohn's Disease
Badash, Zvi, Ben-Atya, Hadas, Gavrielov, Naama, Hazan, Liam, Focht, Gili, Cytter-Kuint, Ruth, Hagopian, Talar, Turner, Dan, Freiman, Moti
Extracting structured clinical information from radiology reports is challenging, especially in low-resource languages. This is pronounced in Crohn's disease, with sparsely represented multi-organ findings. We developed Hierarchical Structured Matching Prediction BERT (HSMP-BERT), a prompt-based model for extraction from Hebrew radiology text. In an administrative database study, we analyzed 9,683 reports from Crohn's patients imaged 2010-2023 across Israeli providers. A subset of 512 reports was radiologist-annotated for findings across six gastrointestinal organs and 15 pathologies, yielding 90 structured labels per subject. Multilabel-stratified split (66% train+validation; 33% test), preserving label prevalence. Performance was evaluated with accuracy, F1, Cohen's $ฮบ$, AUC, PPV, NPV, and recall. On 24 organ-finding combinations with $>$15 positives, HSMP-BERT achieved mean F1 0.83$\pm$0.08 and $ฮบ$ 0.65$\pm$0.17, outperforming the SMP zero-shot baseline (F1 0.49$\pm$0.07, $ฮบ$ 0.06$\pm$0.07) and standard fine-tuning (F1 0.30$\pm$0.27, $ฮบ$ 0.27$\pm$0.34; paired t-test $p < 10^{-7}$). Hierarchical inference cuts runtime 5.1$\times$ vs. traditional inference. Applied to all reports, it revealed associations among ileal wall thickening, stenosis, and pre-stenotic dilatation, plus age- and sex-specific trends in inflammatory findings. HSMP-BERT offers a scalable solution for structured extraction in radiology, enabling population-level analysis of Crohn's disease and demonstrating AI's potential in low-resource settings.
Combine Virtual Reality and Machine-Learning to Identify the Presence of Dyslexia: A Cross-Linguistic Approach
Materazzini, Michele, Morciano, Gianluca, Alcalde-Llergo, Jose Manuel, Yeguas-Bolivar, Enrique, Calabro, Giuseppe, Zingoni, Andrea, Taborri, Juri
This study explores the use of virtual reality (VR) and artificial intelligence (AI) to predict the presence of dyslexia in Italian and Spanish university students. In particular, the research investigates whether VR-derived data from Silent Reading (SR) tests and self-esteem assessments can differentiate between students that are affected by dyslexia and students that are not, employing machine learning (ML) algorithms. Participants completed VR-based tasks measuring reading performance and self-esteem. A preliminary statistical analysis (t tests and Mann Whitney tests) on these data was performed, to compare the obtained scores between individuals with and without dyslexia, revealing significant differences in completion time for the SR test, but not in accuracy, nor in self esteem. Then, supervised ML models were trained and tested, demonstrating an ability to classify the presence/absence of dyslexia with an accuracy of 87.5 per cent for Italian, 66.6 per cent for Spanish, and 75.0 per cent for the pooled group. These findings suggest that VR and ML can effectively be used as supporting tools for assessing dyslexia, particularly by capturing differences in task completion speed, but language-specific factors may influence classification accuracy.
Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?
Nan, Yang, He, Pengfei, Tandon, Ravi, Xu, Han
Large language models (LLMs) have delivered significant breakthroughs across diverse domains but can still produce unreliable or misleading outputs, posing critical challenges for real-world applications. While many recent studies focus on quantifying model uncertainty, relatively little work has been devoted to \textit{diagnosing the source of uncertainty}. In this study, we show that, when an LLM is uncertain, the patterns of disagreement among its multiple generated responses contain rich clues about the underlying cause of uncertainty. To illustrate this point, we collect multiple responses from a target LLM and employ an auxiliary LLM to analyze their patterns of disagreement. The auxiliary model is tasked to reason about the likely source of uncertainty, such as whether it stems from ambiguity in the input question, a lack of relevant knowledge, or both. In cases involving knowledge gaps, the auxiliary model also identifies the specific missing facts or concepts contributing to the uncertainty. In our experiment, we validate our framework on AmbigQA, OpenBookQA, and MMLU-Pro, confirming its generality in diagnosing distinct uncertainty sources. Such diagnosis shows the potential for relevant manual interventions that improve LLM performance and reliability.
CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection
Chen, Yihan, Chen, Jiawei, Mo, Guozhao, Chen, Xuanang, He, Ben, Han, Xianpei, Sun, Le
The growing integration of large language models (LLMs) into the peer review process presents potential risks to the fairness and reliability of scholarly evaluation. While LLMs offer valuable assistance for reviewers with language refinement, there is growing concern over their use to generate substantive review content. Existing general AI-generated text detectors are vulnerable to paraphrasing attacks and struggle to distinguish between surface language refinement and substantial content generation, suggesting that they primarily rely on stylistic cues. When applied to peer review, this limitation can result in unfairly suspecting reviews with permissible AI-assisted language enhancement, while failing to catch deceptively humanized AI-generated reviews. To address this, we propose a paradigm shift from style-based to content-based detection. Specifically, we introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews, covering six distinct modes of human-AI collaboration. Furthermore, we develop CoCoDet, an AI review detector via a multi-task learning framework, designed to achieve more accurate and robust detection of AI involvement in review content. Our work offers a practical foundation for evaluating the use of LLMs in peer review, and contributes to the development of more precise, equitable, and reliable detection methods for real-world scholarly applications.
MHSNet:An MoE-based Hierarchical Semantic Representation Network for Accurate Duplicate Resume Detection with Large Language Model
Li, Yu, Chen, Zulong, Xu, Wenjian, Wen, Hong, Yu, Yipeng, Yiu, Man Lung, Yin, Yuyu
To maintain the company's talent pool, recruiters need to continuously search for resumes from third-party websites (e.g., LinkedIn, Indeed). However, fetched resumes are often incomplete and inaccurate. To improve the quality of third-party resumes and enrich the company's talent pool, it is essential to conduct duplication detection between the fetched resumes and those already in the company's talent pool. Such duplication detection is challenging due to the semantic complexity, structural heterogeneity, and information incompleteness of resume texts. To this end, we propose MHSNet, an multi-level identity verification framework that fine-tunes BGE-M3 using contrastive learning. With the fine-tuned , Mixture-of-Experts (MoE) generates multi-level sparse and dense representations for resumes, enabling the computation of corresponding multi-level semantic similarities. Moreover, the state-aware Mixture-of-Experts (MoE) is employed in MHSNet to handle diverse incomplete resumes. Experimental results verify the effectiveness of MHSNet