Goto

Collaborating Authors

 test result


RoboChallenge: Large-scale Real-robot Evaluation of Embodied Policies

Yakefu, Adina, Xie, Bin, Xu, Chongyang, Zhang, Enwen, Zhou, Erjin, Jia, Fan, Yang, Haitao, Fan, Haoqiang, Zhang, Haowei, Peng, Hongyang, Tan, Jing, Huang, Junwen, Liu, Kai, Liu, Kaixin, Gu, Kefan, Zhang, Qinglun, Zhang, Ruitao, Huang, Saike, Cheng, Shen, Liu, Shuaicheng, Wang, Tiancai, Wang, Tiezhen, Sun, Wei, Tang, Wenbin, Wei, Yajun, Chen, Yang, Gui, Youqiang, Zhao, Yucheng, Ma, Yunchao, Wei, Yunfei, Yang, Yunhuan, Guo, Yutong, Chen, Ze, Du, Zhengyuan, Zhang, Ziheng, Liu, Ziming, Yan, Ziwei

arXiv.org Artificial Intelligence

Testing on real machines is indispensable for robotic control algorithms. In the context of learning-based algorithms, especially VLA models, demand for large-scale evaluation, i.e. testing a large number of models on a large number of tasks, is becoming increasingly urgent. However, doing this right is highly non-trivial, especially when scalability and reproducibility is taken into account. In this report, we describe our methodology for constructing RoboChallenge, an online evaluation system to test robotic control algorithms, and our survey of recent state-of-the-art VLA models using our initial benchmark Table30.


Accessible, Realistic, and Fair Evaluation of Positive-Unlabeled Learning Algorithms

Wang, Wei, Wu, Dong-Dong, Li, Ming, Zhang, Jingxiong, Niu, Gang, Sugiyama, Masashi

arXiv.org Artificial Intelligence

Positive-unlabeled (PU) learning is a weakly supervised binary classification problem, in which the goal is to learn a binary classifier from only positive and unlabeled data, without access to negative data. In recent years, many PU learning algorithms have been developed to improve model performance. However, experimental settings are highly inconsistent, making it difficult to identify which algorithm performs better. In this paper, we propose the first PU learning benchmark to systematically compare PU learning algorithms. During our implementation, we identify subtle yet critical factors that affect the realistic and fair evaluation of PU learning algorithms. On the one hand, many PU learning algorithms rely on a validation set that includes negative data for model selection. This is unrealistic in traditional PU learning settings, where no negative data are available. To handle this problem, we systematically investigate model selection criteria for PU learning. On the other hand, the problem settings and solutions of PU learning have different families, i.e., the one-sample and two-sample settings. However, existing evaluation protocols are heavily biased towards the one-sample setting and neglect the significant difference between them. We identify the internal label shift problem of unlabeled training data for the one-sample setting and propose a simple yet effective calibration approach to ensure fair comparisons within and across families. We hope our framework will provide an accessible, realistic, and fair environment for evaluating PU learning algorithms in the future.


Prediction of Hospital Associated Infections During Continuous Hospital Stays

Datta, Rituparna, Kamruzzaman, Methun, Klein, Eili Y., Madden, Gregory R, Deng, Xinwei, Vullikanti, Anil, Bhattacharya, Parantapa

arXiv.org Artificial Intelligence

The US Centers for Disease Control and Prevention (CDC), in 2019, designated Methicillin-resistant Staphylococcus au-reus (MRSA) as a serious antimicrobial resistance threat. The risk of acquiring MRSA and suffering life-threatening consequences due to it remains especially high for hospitalized patients due to a unique combination of factors, including: co-morbid conditions, immunosuppression, and antibiotic use, and risk of contact with contaminated hospital workers and equipment. In this paper, we present a novel generative probabilistic model, GenHAI, for modeling sequences of MRSA test results outcomes for patients during a single hospitalization. This model can be used to answer many important questions from the perspectives of hospital administrators for mitigating the risk of MRSA infections. Our model is based on the probabilistic programming paradigm, and can be used to approximately answer a variety of predictive, causal, and counterfactual questions. We demonstrate the efficacy of our model by comparing it against discriminative and generative machine learning models using two real world datasets.


Tree-of-Reasoning: Towards Complex Medical Diagnosis via Multi-Agent Reasoning with Evidence Tree

Peng, Qi, Cui, Jialin, Xie, Jiayuan, Cai, Yi, Li, Qing

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown great potential in the medical domain. However, existing models still fall short when faced with complex medical diagnosis task in the real world. This is mainly because they lack sufficient reasoning depth, which leads to information loss or logical jumps when processing a large amount of specialized medical data, leading to diagnostic errors. To address these challenges, we propose Tree-of-Reasoning (ToR), a novel multi-agent framework designed to handle complex scenarios. Specifically, ToR introduces a tree structure that can clearly record the reasoning path of LLMs and the corresponding clinical evidence. At the same time, we propose a cross-validation mechanism to ensure the consistency of multi-agent decision-making, thereby improving the clinical reasoning ability of multi-agents in complex medical scenarios. Experimental results on real-world medical data show that our framework can achieve better performance than existing baseline methods.


Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering

Byun, Grace, Lee, Shinsun, Choi, Nayoung, Choi, Jinho D.

arXiv.org Artificial Intelligence

Existing Retrieval-Augmented Generation (RAG) systems face challenges in enterprise settings due to limited retrieval scope and data security risks. When relevant internal documents are unavailable, the system struggles to generate accurate and complete responses. Additionally, using closed-source Large Language Models (LLMs) raises concerns about exposing proprietary information. To address these issues, we propose the Secure Multifaceted-RAG (SecMulti-RAG) framework, which retrieves not only from internal documents but also from two supplementary sources: pre-generated expert knowledge for anticipated queries and on-demand external LLM-generated knowledge. To mitigate security risks, we adopt a local open-source generator and selectively utilize external LLMs only when prompts are deemed safe by a filtering mechanism. This approach enhances completeness, prevents data leakage, and reduces costs. In our evaluation on a report generation task in the automotive industry, SecMulti-RAG significantly outperforms traditional RAG - achieving 79.3 to 91.9 percent win rates across correctness, richness, and helpfulness in LLM-based evaluation, and 56.3 to 70.4 percent in human evaluation. This highlights SecMulti-RAG as a practical and secure solution for enterprise RAG.


DiaLLMs: EHR Enhanced Clinical Conversational System for Clinical Test Recommendation and Diagnosis Prediction

Ren, Weijieying, Zhao, Tianxiang, Wang, Lei, Wang, Tianchun, Honavar, Vasant

arXiv.org Artificial Intelligence

Recent advances in Large Language Models (LLMs) have led to remarkable progresses in medical consultation. However, existing medical LLMs overlook the essential role of Electronic Health Records (EHR) and focus primarily on diagnosis recommendation, limiting their clinical applicability. We propose DiaLLM, the first medical LLM that integrates heterogeneous EHR data into clinically grounded dialogues, enabling clinical test recommendation, result interpretation, and diagnosis prediction to better align with real-world medical practice. To construct clinically grounded dialogues from EHR, we design a Clinical Test Reference (CTR) strategy that maps each clinical code to its corresponding description and classifies test results as "normal" or "abnormal". Additionally, DiaLLM employs a reinforcement learning framework for evidence acquisition and automated diagnosis. To handle the large action space, we introduce a reject sampling strategy to reduce redundancy and improve exploration efficiency. Furthermore, a confirmation reward and a class-sensitive diagnosis reward are designed to guide accurate diagnosis prediction. Extensive experimental results demonstrate that DiaLLM outperforms baselines in clinical test recommendation and diagnosis prediction.


Converting Annotated Clinical Cases into Structured Case Report Forms

Ferrazzi, Pietro, Lavelli, Alberto, Magnini, Bernardo

arXiv.org Artificial Intelligence

Case Report Forms (CRFs) are largely used in medical research as they ensure accuracy, reliability, and validity of results in clinical studies. However, publicly available, wellannotated CRF datasets are scarce, limiting the development of CRF slot filling systems able to fill in a CRF from clinical notes. To mitigate the scarcity of CRF datasets, we propose to take advantage of available datasets annotated for information extraction tasks and to convert them into structured CRFs. We present a semi-automatic conversion methodology, which has been applied to the E3C dataset in two languages (English and Italian), resulting in a new, high-quality dataset for CRF slot filling. Through several experiments on the created dataset, we report that slot filling achieves 59.7% for Italian and 67.3% for English on a closed Large Language Models (zero-shot) and worse performances on three families of open-source models, showing that filling CRFs is challenging even for recent state-of-the-art LLMs. We release the datest at https://huggingface.co/collections/NLP-FBK/e3c-to-crf-67b9844065460cbe42f80166


TestAgent: An Adaptive and Intelligent Expert for Human Assessment

Yu, Junhao, Zhuang, Yan, Sun, YuXuan, Gao, Weibo, Liu, Qi, Cheng, Mingyue, Huang, Zhenya, Chen, Enhong

arXiv.org Artificial Intelligence

Accurately assessing internal human states is key to understanding preferences, offering personalized services, and identifying challenges in real-world applications. Originating from psychometrics, adaptive testing has become the mainstream method for human measurement and has now been widely applied in education, healthcare, sports, and sociology. It customizes assessments by selecting the fewest test questions . However, current adaptive testing methods face several challenges. The mechanized nature of most algorithms leads to guessing behavior and difficulties with open-ended questions. Additionally, subjective assessments suffer from noisy response data and coarse-grained test outputs, further limiting their effectiveness. To move closer to an ideal adaptive testing process, we propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement. This is the first application of LLMs in adaptive testing. TestAgent supports personalized question selection, captures test-takers' responses and anomalies, and provides precise outcomes through dynamic, conversational interactions. Experiments on psychological, educational, and lifestyle assessments show our approach achieves more accurate results with 20% fewer questions than state-of-the-art baselines, and testers preferred it in speed, smoothness, and other dimensions.


Did faulty drug tests taint parole hearings? California is reviewing hundreds of denials

Los Angeles Times

The California Department of Corrections and Rehabilitation is reviewing hundreds of state parole hearings to see if any inmates who were denied parole were rejected because of faulty drug tests. Nearly 6,000 drug tests in California prisons are believed to have yielded false positives between April and July last year, and attorneys for the Board of Parole are now conducting a review of inmate files to determine if any of them need to appear before the parole board again to be reconsidered, according to officials with CDCR. If any inmates were denied parole because of the faulty tests, they could be owed a new hearing before the parole board, said attorneys representing inmates affected by the defective drug tests. The review is already underway and will determine if "without the positive drug screening, there is sufficient evidence to support an incarcerated person's denial of parole," said CDCR spokesperson Emily Humpal in a statement. If there isn't enough evidence to support incarceration other than the drug test, a new hearing will be scheduled.


AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models

Perez, Miguel Angel Peñaloza, Orozco, Bruno Lopez, Soto, Jesus Tadeo Cruz, Hernandez, Michelle Bruno, Gonzalez, Miguel Angel Alvarado, Malagon, Sandra

arXiv.org Artificial Intelligence

Existing mathematical reasoning benchmarks are predominantly English only or translation-based, which can introduce semantic drift and mask languagespecific reasoning errors. To address this, we present AI4Math, a benchmark of 105 original university level math problems natively authored in Spanish. The dataset spans seven advanced domains (Algebra, Calculus, Geometry, Probability, Number Theory, Combinatorics, and Logic), and each problem is accompanied by a step by step human solution. We evaluate six large language models GPT 4o, GPT 4o mini, o3 mini, LLaMA 3.3 70B, DeepSeek R1 685B, and DeepSeek V3 685B under four configurations: zero shot and chain of thought, each in Spanish and English. The top models (o3 mini, DeepSeek R1 685B, DeepSeek V3 685B) achieve over 70% accuracy, whereas LLaMA 3.3 70B and GPT-4o mini remain below 40%. Most models show no significant performance drop between languages, with GPT 4o even performing better on Spanish problems in the zero shot setting. Geometry, Combinatorics, and Probability questions remain persistently challenging for all models. These results highlight the need for native-language benchmarks and domain-specific evaluations to reveal reasoning failures not captured by standard metrics.