Guo, Zhicheng
Rashomon Sets for Prototypical-Part Networks: Editing Interpretable Models in Real-Time
Donnelly, Jon, Guo, Zhicheng, Barnett, Alina Jade, McTavish, Hayden, Chen, Chaofan, Rudin, Cynthia
Interpretability is critical for machine learning models in high-stakes settings because it allows users to verify the model's reasoning. In computer vision, prototypical part models (ProtoPNets) have become the dominant model type to meet this need. Users can easily identify flaws in ProtoPNets, but fixing problems in a ProtoPNet requires slow, difficult retraining that is not guaranteed to resolve the issue. This problem is called the "interaction bottleneck." We solve the interaction bottleneck for ProtoPNets by simultaneously finding many equally good ProtoPNets (i.e., a draw from a "Rashomon set"). We show that our framework - called Proto-RSet - quickly produces many accurate, diverse ProtoPNets, allowing users to correct problems in real time while maintaining performance guarantees with respect to the training set. We demonstrate the utility of this method in two settings: 1) removing synthetic bias introduced to a bird identification model and 2) debugging a skin cancer identification model. This tool empowers non-machine-learning experts, such as clinicians or domain experts, to quickly refine and correct machine learning models without repeated retraining by machine learning experts.
StepTool: A Step-grained Reinforcement Learning Framework for Tool Learning in LLMs
Yu, Yuanqing, Wang, Zhefan, Ma, Weizhi, Guo, Zhicheng, Zhan, Jingtao, Wang, Shuai, Wu, Chuhan, Guo, Zhiqiang, Zhang, Min
Despite having powerful reasoning and inference capabilities, Large Language Models (LLMs) still need external tools to acquire real-time information retrieval or domain-specific expertise to solve complex tasks, which is referred to as tool learning. Existing tool learning methods primarily rely on tuning with expert trajectories, focusing on token-sequence learning from a linguistic perspective. However, there are several challenges: 1) imitating static trajectories limits their ability to generalize to new tasks. 2) even expert trajectories can be suboptimal, and better solution paths may exist. In this work, we introduce StepTool, a novel step-grained reinforcement learning framework to improve tool learning in LLMs. It consists of two components: Step-grained Reward Shaping, which assigns rewards at each tool interaction based on tool invocation success and its contribution to the task, and Step-grained Optimization, which uses policy gradient methods to optimize the model in a multi-step manner. Experimental results demonstrate that StepTool significantly outperforms existing methods in multi-step, tool-based tasks, providing a robust solution for complex task environments. Codes are available at https://github.com/yuyq18/StepTool.
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
Guo, Zhicheng, Cheng, Sijie, Wang, Hao, Liang, Shihao, Qin, Yujia, Li, Peng, Liu, Zhiyuan, Sun, Maosong, Liu, Yang
Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.
SiamQuality: A ConvNet-Based Foundation Model for Imperfect Physiological Signals
Ding, Cheng, Guo, Zhicheng, Chen, Zhaoliang, Lee, Randall J, Rudin, Cynthia, Hu, Xiao
Foundation models, especially those using transformers as backbones, have gained significant popularity, particularly in language and language-vision tasks. However, large foundation models are typically trained on high-quality data, which poses a significant challenge, given the prevalence of poor-quality real-world data. This challenge is more pronounced for developing foundation models for physiological data; such data are often noisy, incomplete, or inconsistent. The present work aims to provide a toolset for developing foundation models on physiological data. We leverage a large dataset of photoplethysmography (PPG) signals from hospitalized intensive care patients. For this data, we propose SimQuality, a novel self-supervised learning task based on convolutional neural networks (CNNs) as the backbone to enforce representations to be similar for good and poor quality signals that are from similar physiological states. We pre-trained the SimQuality on over 36 million 30-second PPG pairs and then fine-tuned and tested on six downstream tasks using external datasets. The results demonstrate the superiority of the proposed approach on all the downstream tasks, which are extremely important for heart monitoring on wearable devices. Our method indicates that CNNs can be an effective backbone for foundation models that are robust to training data quality.
What is different between these datasets?
Babbar, Varun, Guo, Zhicheng, Rudin, Cynthia
The performance of machine learning models heavily depends on the quality of input data, yet real-world applications often encounter various data-related challenges. One such challenge could arise when curating training data or deploying the model in the real world - two comparable datasets in the same domain may have different distributions. While numerous techniques exist for detecting distribution shifts, the literature lacks comprehensive approaches for explaining dataset differences in a human-understandable manner. To address this gap, we propose a suite of interpretable methods (toolbox) for comparing two datasets. We demonstrate the versatility of our approach across diverse data modalities, including tabular data, language, images, and signals in both low and high-dimensional settings. Our methods not only outperform comparable and related approaches in terms of explanation quality and correctness, but also provide actionable, complementary insights to understand and mitigate dataset differences effectively.
Towards Unified Alignment Between Agents, Humans, and Environment
Yang, Zonghan, Liu, An, Liu, Zijun, Liu, Kaiming, Xiong, Fangzhou, Wang, Yile, Yang, Zeyuan, Hu, Qingyuan, Chen, Xinrui, Zhang, Zhenhe, Luo, Fuwen, Guo, Zhicheng, Li, Peng, Liu, Yang
The rapid progress of foundation models has led to the prosperity of autonomous agents, which leverage the universal capabilities of foundation models to conduct reasoning, decision-making, and environmental interaction. However, the efficacy of agents remains limited when operating in intricate, realistic environments. In this work, we introduce the principles of $\mathbf{U}$nified $\mathbf{A}$lignment for $\mathbf{A}$gents ($\mathbf{UA}^2$), which advocate for the simultaneous alignment of agents with human intentions, environmental dynamics, and self-constraints such as the limitation of monetary budgets. From the perspective of $\mathbf{UA}^2$, we review the current agent research and highlight the neglected factors in existing agent benchmarks and method candidates. We also conduct proof-of-concept studies by introducing realistic features to WebShop, including user profiles to demonstrate intentions, personalized reranking for complex environmental dynamics, and runtime cost statistics to reflect self-constraints. We then follow the principles of $\mathbf{UA}^2$ to propose an initial design of our agent, and benchmark its performance with several candidate baselines in the retrofitted WebShop. The extensive experimental results further prove the importance of the principles of $\mathbf{UA}^2$. Our research sheds light on the next steps of autonomous agent research with improved general problem-solving abilities.
Learned Kernels for Interpretable and Efficient Medical Time Series Processing
Chen, Sully F., Guo, Zhicheng, Ding, Cheng, Hu, Xiao, Rudin, Cynthia
Background: Signal processing methods are the foundation for clinical interpretation across a wide variety of medical applications. The advent of deep learning allowed for an explosion of new models that offered unprecedented performance but at a cost: deep learning models are often compute-intensive and lack interpretability. Methods: We propose a sparse, interpretable architecture for medical time series processing. The method learns a set of lightweight flexible kernels to construct a single-layer neural network, providing a new efficient, robust, and interpretable approach. We introduce novel parameter reduction techniques to further reduce the size of our network. We demonstrate the power of our architecture on the important task of photoplethysmography artifact detection, where our approach has performance similar to the state-of-the-art deep neural networks with several orders of magnitude fewer parameters, allowing for the integration of deep neural network level performance into extremely low-power wearable devices. Results: Our interpretable method achieves greater than 99\% of the performance of the state-of-the-art methods on the artifact detection task, and even outperforms the state-of-the-art on a challenging out-of-distribution test set, while using dramatically fewer parameters (2\% of the parameters of Segade, and about half of the parameters of Tiny-PPG). Conclusions: Learned kernels are competitive with deep neural networks for medical time series processing with dramatically fewer parameters. Our method is particularly suited for real-time applications and low-power devices, and it maintains interpretability.
Reconsideration on evaluation of machine learning models in continuous monitoring using wearables
Ding, Cheng, Guo, Zhicheng, Rudin, Cynthia, Xiao, Ran, Nahab, Fadi B, Hu, Xiao
Especially with the utilization of photoplethysmography (PPG) signal, these devices have demonstrated significant potential in providing real-time insights into an individual's health status. PPG, due to its non-invasive nature and ease of integration into wearable technology, has become a cornerstone in modern health monitoring systems [5]. Analyzing wearable device signals often involves ML models of different complexities [6, 7]. In the model development phase, typically, continuous signals are cut into discrete segments, and the model's performance is evaluated at the segment level using conventional metrics such as accuracy, sensitivity, specificity, and F1 score [8]. However, relying solely on these conventional metrics at the segment level does not provide a holistic assessment and hurts both consumers by making it impossible to select optimal solution for their needs and innovators by failing to guide their effort towards true progresses. The complex nature of continuous health monitoring using wearable devices introduces unique challenges beyond conventional evaluation approaches' capabilities, as illustrated in Figure 1. Recognizing these challenges is imperative for imbuing continuous health monitoring applications with accurate and reliable ML models to ensure a successful translation of these models into everyday use by millions of people and fulfill the potential of this technology at scale. In the subsequent sections, we outline the challenges in evaluating ML models for continuous health monitoring using wearables, thoroughly review existing evaluation methods and metrics, and propose a standardized evaluation guideline.
Can Vision-Language Models Think from a First-Person Perspective?
Cheng, Sijie, Guo, Zhicheng, Wu, Jingwen, Fang, Kechen, Li, Peng, Liu, Huaping, Liu, Yang
Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.
SiamAF: Learning Shared Information from ECG and PPG Signals for Robust Atrial Fibrillation Detection
Guo, Zhicheng, Ding, Cheng, Do, Duc H., Shah, Amit, Lee, Randall J., Hu, Xiao, Rudin, Cynthia
Atrial fibrillation (AF) is the most common type of cardiac arrhythmia. It is associated with an increased risk of stroke, heart failure, and other cardiovascular complications, but can be clinically silent. Passive AF monitoring with wearables may help reduce adverse clinical outcomes related to AF. Detecting AF in noisy wearable data poses a significant challenge, leading to the emergence of various deep learning techniques. Previous deep learning models learn from a single modality, either electrocardiogram (ECG) or photoplethysmography (PPG) signals. However, deep learning models often struggle to learn generalizable features and rely on features that are more susceptible to corruption from noise, leading to sub-optimal performances in certain scenarios, especially with low-quality signals. Given the increasing availability of ECG and PPG signal pairs from wearables and bedside monitors, we propose a new approach, SiamAF, leveraging a novel Siamese network architecture and joint learning loss function to learn shared information from both ECG and PPG signals. At inference time, the proposed model is able to predict AF from either PPG or ECG and outperforms baseline methods on three external test sets. It learns medically relevant features as a result of our novel architecture design. The proposed model also achieves comparable performance to traditional learning regimes while requiring much fewer training labels, providing a potential approach to reduce future reliance on manual labeling.