Shi, Yuanchun
MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding
Ye, Zhoutong, Sun, Mingze, Gao, Huan-ang, Yu, Chun, Shi, Yuanchun
Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, there remains a significant gap between state-of-the-art LMMs and human performance when it comes to complex tasks that require a combination of fundamental VL capabilities, as well as tasks involving the grounding of complex instructions. To thoroughly investigate the human-LMM gap and its underlying causes, we propose MOAT, a diverse benchmark with complex real-world VL tasks that are challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating fundamental VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 10 fundamental VL capabilities, enabling MOAT to provide a fine-grained view of LMMs' strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs' ability to ground complex text and visual instructions, which is essential to many real-world applications. We evaluate over 20 proprietary and open source LMMs, as well as humans, on MOAT, and found that humans achieved 82.7% accuracy while the best performing LMM (OpenAI o1) achieved only 38.8%. To guide future model development, we analyze common trends in our results and discuss the underlying causes of observed performance gaps between LMMs and humans, focusing on which VL capability forms the bottleneck in complex tasks, whether test time scaling improves performance on MOAT, and how tiling harms LMMs' capability to count. Code and data are available at https://cambrian-yzt.github.io/MOAT.
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
Wei, Tong, Yang, Yijun, Xing, Junliang, Shi, Yuanchun, Lu, Zongqing, Ye, Deheng
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs). Yet, its efficacy in training vision-language model (VLM) agents for goal-directed action reasoning in visual environments is less established. This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse, characterized by a rapid loss of diversity in the agent's thoughts, state-irrelevant and incomplete reasoning, and subsequent invalid actions, resulting in negative rewards. To counteract thought collapse, we highlight the necessity of process guidance and propose an automated corrector that evaluates and refines the agent's reasoning at each RL step. This simple and scalable GTR (Guided Thought Reinforcement) framework trains reasoning and action simultaneously without the need for dense, per-step human labeling. Our experiments demonstrate that GTR significantly enhances the performance and generalization of the LLaVA-7b model across various visual environments, achieving 3-5 times higher task success rates compared to SoTA models with notably smaller model sizes.
BodyGen: Advancing Towards Efficient Embodiment Co-Design
Lu, Haofei, Wu, Zhe, Xing, Junliang, Li, Jianshu, Li, Ruoyu, Li, Zhe, Shi, Yuanchun
Embodiment co-design aims to optimize a robot's morphology and control policy simultaneously. While prior work has demonstrated its potential for generating environment-adaptive robots, this field still faces persistent challenges in optimization efficiency due to the (i) combinatorial nature of morphological search spaces and (ii) intricate dependencies between morphology and control. We prove that the ineffective morphology representation and unbalanced reward signals between the design and control stages are key obstacles to efficiency. To advance towards efficient embodiment co-design, we propose BodyGen, which utilizes (1) topology-aware self-attention for both design and control, enabling efficient morphology representation with lightweight model sizes; (2) a temporal credit assignment mechanism that ensures balanced reward signals for optimization. With our findings, Body achieves an average 60.03% performance improvement against state-of-the-art baselines. We provide codes and more results on the website: https://genesisorigin.github.io.
Unknown Word Detection for English as a Second Language (ESL) Learners Using Gaze and Pre-trained Language Models
Ding, Jiexin, Zhao, Bowen, Wang, Yuntao, Liu, Xinyun, Hao, Rui, Chatterjee, Ishan, Shi, Yuanchun
English as a Second Language (ESL) learners often encounter unknown words that hinder their text comprehension. Automatically detecting these words as users read can enable computing systems to provide just-in-time definitions, synonyms, or contextual explanations, thereby helping users learn vocabulary in a natural and seamless manner. This paper presents EyeLingo, a transformer-based machine learning method that predicts the probability of unknown words based on text content and eye gaze trajectory in real time with high accuracy. A 20-participant user study revealed that our method can achieve an accuracy of 97.6%, and an F1-score of 71.1%. We implemented a real-time reading assistance prototype to show the effectiveness of EyeLingo. The user study shows improvement in willingness to use and usefulness compared to baseline methods.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Li, Qixiu, Liang, Yaobo, Wang, Zeyu, Luo, Lin, Chen, Xi, Liao, Mozheng, Wei, Fangyun, Deng, Yu, Xu, Sicheng, Zhang, Yizhong, Wang, Xiaofan, Liu, Bei, Fu, Jianlong, Bao, Jianmin, Chen, Dong, Shi, Yuanchun, Yang, Jiaolong, Guo, Baining
The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (https://cogact.github.io/).
G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios
Wang, Zeyu, Shi, Yuanchun, Wang, Yuntao, Yao, Yuchen, Yan, Kun, Wang, Yuhan, Ji, Lei, Xu, Xuhai, Yu, Chun
Modern information querying systems are progressively incorporating multimodal inputs like vision and audio. However, the integration of gaze -- a modality deeply linked to user intent and increasingly accessible via gaze-tracking wearables -- remains underexplored. This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA, which synergizes users' gaze, visual field, and voice-based natural language queries to facilitate a more intuitive querying process. In a user-enactment study involving 21 participants in 3 daily scenarios (p = 21, scene = 3), we revealed the ambiguity in users' query language and a gaze-voice coordination pattern in users' natural query behaviors with G-VOILA. Based on the quantitative and qualitative findings, we developed a design framework for the G-VOILA paradigm, which effectively integrates the gaze data with the in-situ querying context. Then we implemented a G-VOILA proof-of-concept using cutting-edge deep learning techniques. A follow-up user study (p = 16, scene = 2) demonstrates its effectiveness by achieving both higher objective score and subjective score, compared to a baseline without gaze data. We further conducted interviews and provided insights for future gaze-facilitated information querying systems.
Time2Stop: Adaptive and Explainable Human-AI Loop for Smartphone Overuse Intervention
Orzikulova, Adiba, Xiao, Han, Li, Zhipeng, Yan, Yukang, Wang, Yuntao, Shi, Yuanchun, Ghassemi, Marzyeh, Lee, Sung-Ju, Dey, Anind K, Xu, Xuhai "Orson"
Despite a rich history of investigating smartphone overuse intervention techniques, AI-based just-in-time adaptive intervention (JITAI) methods for overuse reduction are lacking. We develop Time2Stop, an intelligent, adaptive, and explainable JITAI system that leverages machine learning to identify optimal intervention timings, introduces interventions with transparent AI explanations, and collects user feedback to establish a human-AI loop and adapt the intervention model over time. We conducted an 8-week field experiment (N=71) to evaluate the effectiveness of both the adaptation and explanation aspects of Time2Stop. Our results indicate that our adaptive models significantly outperform the baseline methods on intervention accuracy (>32.8\% relatively) and receptivity (>8.0\%). In addition, incorporating explanations further enhances the effectiveness by 53.8\% and 11.4\% on accuracy and receptivity, respectively. Moreover, Time2Stop significantly reduces overuse, decreasing app visit frequency by 7.0$\sim$8.9\%. Our subjective data also echoed these quantitative measures. Participants preferred the adaptive interventions and rated the system highly on intervention time accuracy, effectiveness, and level of trust. We envision our work can inspire future research on JITAI systems with a human-AI loop to evolve with users.
MMTSA: Multimodal Temporal Segment Attention Network for Efficient Human Activity Recognition
Gao, Ziqi, Wang, Yuntao, Chen, Jianguo, Xing, Junliang, Patel, Shwetak, Liu, Xin, Shi, Yuanchun
Multimodal sensors provide complementary information to develop accurate machine-learning methods for human activity recognition (HAR), but introduce significantly higher computational load, which reduces efficiency. This paper proposes an efficient multimodal neural architecture for HAR using an RGB camera and inertial measurement units (IMUs) called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first transforms IMU sensor data into a temporal and structure-preserving gray-scale image using the Gramian Angular Field (GAF), representing the inherent properties of human activities. MMTSA then applies a multimodal sparse sampling method to reduce data redundancy. Lastly, MMTSA adopts an inter-segment attention module for efficient multimodal fusion. Using three well-established public datasets, we evaluated MMTSA's effectiveness and efficiency in HAR. Results show that our method achieves superior performance improvements 11.13% of cross-subject F1-score on the MMAct dataset than the previous state-of-the-art (SOTA) methods. The ablation study and analysis suggest that MMTSA's effectiveness in fusing multimodal data for accurate HAR. The efficiency evaluation on an edge device showed that MMTSA achieved significantly better accuracy, lower computational load, and lower inference latency than SOTA methods.
MindShift: Leveraging Large Language Models for Mental-States-Based Problematic Smartphone Use Intervention
Wu, Ruolan, Yu, Chun, Pan, Xiaole, Liu, Yujia, Zhang, Ningning, Fu, Yue, Wang, Yuhan, Zheng, Zhi, Chen, Li, Jiang, Qiaolei, Xu, Xuhai, Shi, Yuanchun
Problematic smartphone use negatively affects physical and mental health. Despite the wide range of prior research, existing persuasive techniques are not flexible enough to provide dynamic persuasion content based on users' physical contexts and mental states. We first conduct a Wizard-of-Oz study (N=12) and an interview study (N=10) to summarize the mental states behind problematic smartphone use: boredom, stress, and inertia. This informs our design of four persuasion strategies: understanding, comforting, evoking, and scaffolding habits. We leverage large language models (LLMs) to enable the automatic and dynamic generation of effective persuasion content. We develop MindShift, a novel LLM-powered problematic smartphone use intervention technique. MindShift takes users' in-the-moment physical contexts, mental states, app usage behaviors, users' goals & habits as input, and generates high-quality and flexible persuasive content with appropriate persuasion strategies. We conduct a 5-week field experiment (N=25) to compare MindShift with baseline techniques. The results show that MindShift significantly improves intervention acceptance rates by 17.8-22.5% and reduces smartphone use frequency by 12.1-14.4%. Moreover, users have a significant drop in smartphone addiction scale scores and a rise in self-efficacy. Our study sheds light on the potential of leveraging LLMs for context-aware persuasion in other behavior change domains.
EarCough: Enabling Continuous Subject Cough Event Detection on Hearables
Zhang, Xiyuxing, Wang, Yuntao, Zhang, Jingru, Yang, Yaqing, Patel, Shwetak, Shi, Yuanchun
Cough monitoring can enable new individual pulmonary health applications. Subject cough event detection is the foundation for continuous cough monitoring. Recently, the rapid growth in smart hearables has opened new opportunities for such needs. This paper proposes EarCough, which enables continuous subject cough event detection on edge computing hearables by leveraging the always-on active noise cancellation (ANC) microphones. Specifically, we proposed a lightweight end-to-end neural network model -- EarCoughNet. To evaluate the effectiveness of our method, we constructed a synchronous motion and audio dataset through a user study. Results show that EarCough achieved an accuracy of 95.4% and an F1-score of 92.9% with a space requirement of only 385 kB. We envision EarCough as a low-cost add-on for future hearables to enable continuous subject cough event detection.