Hu, Jingtong
EmpathyAgent: Can Embodied Agents Conduct Empathetic Actions?
Chen, Xinyan, Ge, Jiaxin, Dai, Hongming, Zhou, Qiang, Feng, Qiuxuan, Hu, Jingtong, Wang, Yizhou, Liu, Jiaming, Zhang, Shanghang
Empathy is fundamental to human interactions, yet it remains unclear whether embodied agents can provide human-like empathetic support. Existing works have studied agents' tasks solving and social interactions abilities, but whether agents can understand empathetic needs and conduct empathetic behaviors remains overlooked. To address this, we introduce EmpathyAgent, the first benchmark to evaluate and enhance agents' empathetic actions across diverse scenarios. EmpathyAgent contains 10,000 multimodal samples with corresponding empathetic task plans and three different challenges. To systematically evaluate the agents' empathetic actions, we propose an empathy-specific evaluation suite that evaluates the agents' empathy process. We benchmark current models and found that exhibiting empathetic actions remains a significant challenge. Meanwhile, we train Llama3-8B using EmpathyAgent and find it can potentially enhance empathetic behavior. By establishing a standard benchmark for evaluating empathetic actions, we hope to advance research in empathetic embodied agents. Our code and data are publicly available at https://github.com/xinyan-cxy/EmpathyAgent.
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis
Wang, Pan, Zhou, Qiang, Wu, Yawen, Chen, Tianlong, Hu, Jingtong
Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such as language, vision, and audio, to enhance the understanding of human sentiment. While existing models often focus on extracting shared information across modalities or directly fusing heterogeneous modalities, such approaches can introduce redundancy and conflicts due to equal treatment of all modalities and the mutual transfer of information between modality pairs. To address these issues, we propose a Disentangled-Language-Focused (DLF) multimodal representation learning framework, which incorporates a feature disentanglement module to separate modality-shared and modality-specific information. To further reduce redundancy and enhance language-targeted features, four geometric measures are introduced to refine the disentanglement process. A Language-Focused Attractor (LFA) is further developed to strengthen language representation by leveraging complementary modality-specific information through a language-guided cross-attention mechanism. The framework also employs hierarchical predictions to improve overall accuracy. Extensive experiments on two popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant performance gains achieved by the proposed DLF framework. Comprehensive ablation studies further validate the effectiveness of the feature disentanglement module, language-focused attractor, and hierarchical predictions. Our code is available at https://github.com/pwang322/DLF.
EdgeOL: Efficient in-situ Online Learning on Edge Devices
Li, Sheng, Yuan, Geng, Wu, Yawen, Dai, Yue, Wu, Chao, Jones, Alex K., Hu, Jingtong, Wang, Yanzhi, Tang, Xulong
Emerging applications, such as robot-assisted eldercare and object recognition, generally employ deep learning neural networks (DNNs) models and naturally require: i) handling streaming-in inference requests and ii) adapting to possible deployment scenario changes. Online model fine-tuning is widely adopted to satisfy these needs. However, fine-tuning involves significant energy consumption, making it challenging to deploy on edge devices. In this paper, we propose EdgeOL, an edge online learning framework that optimizes inference accuracy, fine-tuning execution time, and energy efficiency through both inter-tuning and intra-tuning optimizations. Experimental results show that, on average, EdgeOL reduces overall fine-tuning execution time by 82%, energy consumption by 74%, and improves average inference accuracy by 1.70% over the immediate online learning strategy.
Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis
Qin, Ruiyang, Xia, Jun, Jia, Zhenge, Jiang, Meng, Abbasi, Ahmed, Zhou, Peipei, Hu, Jingtong, Shi, Yiyu
After a large language model (LLM) is deployed on edge devices, it is desirable for these devices to learn from user-generated conversation data to generate user-specific and personalized responses in real-time. However, user-generated data usually contains sensitive and private information, and uploading such data to the cloud for annotation is not preferred if not prohibited. While it is possible to obtain annotation locally by directly asking users to provide preferred responses, such annotations have to be sparse to not affect user experience. In addition, the storage of edge devices is usually too limited to enable large-scale fine-tuning with full user-generated data. It remains an open question how to enable on-device LLM personalization, considering sparse annotation and limited on-device storage. In this paper, we propose a novel framework to select and store the most representative data online in a self-supervised way. Such data has a small memory footprint and allows infrequent requests of user annotations for further fine-tuning. To enhance fine-tuning quality, multiple semantically similar pairs of question texts and expected responses are generated using the LLM. Our experiments show that the proposed framework achieves the best user-specific content-generating capability (accuracy) and fine-tuning speed (performance) compared with vanilla baselines. To the best of our knowledge, this is the very first on-device LLM personalization framework.
Development of A Real-time POCUS Image Quality Assessment and Acquisition Guidance System
Jia, Zhenge, Shi, Yiyu, Hu, Jingtong, Yang, Lei, Nti, Benjamin
Automated acquisition guidance can help novice learners to study how to manipulate probes to acquire high-quality POCUS images. However, the difference between human vision and machine vision may confuse novice learners, because the guidance from sonographers and the DNN model may be inconsistent due to the different interpretations of images between the DNN-based models and humans [6]. For example, DNNs rely more on textures and high-frequency features rather than shapes and low-frequency features. Our previous works show that machine learning based detection methods on medical image can achieve significantly high accuracy [7-15]. This means images considered highquality for humans may not be high-quality for DNN-based machine vision. The discrepancy becomes more significant due to the large inter-observer variability. As a result, if the acquired images are of high quality for both human and DNN-based segmentation models, the segmentation accuracy can be improved, and accordingly, the human efforts needed to correct or identify the bad segmentation can be reduced. As an example in a closely-related area, previous works for medical image compression demonstrated that image compressed with machine vision considered achieved significantly higher segmentation accuracy than conventional methods that only consider human vision at the same compression rate [16]. Therefore, we will explore methods to optimize the image quality assessment and acquisition guidance to be preferred by both human and machine models.
Synthetic Data Can Also Teach: Synthesizing Effective Data for Unsupervised Visual Representation Learning
Wu, Yawen, Wang, Zhepeng, Zeng, Dewen, Shi, Yiyu, Hu, Jingtong
Contrastive learning (CL), a self-supervised learning approach, can effectively learn visual representations from unlabeled data. Given the CL training data, generative models can be trained to generate synthetic data to supplement the real data. Using both synthetic and real data for CL training has the potential to improve the quality of learned representations. However, synthetic data usually has lower quality than real data, and using synthetic data may not improve CL compared with using real data. To tackle this problem, we propose a data generation framework with two methods to improve CL training by joint sample generation and contrastive learning. The first approach generates hard samples for the main model. The generator is jointly learned with the main model to dynamically customize hard samples based on the training state of the main model. Besides, a pair of data generators are proposed to generate similar but distinct samples as positive pairs. In joint learning, the hardness of a positive pair is progressively increased by decreasing their similarity. Experimental results on multiple datasets show superior accuracy and data efficiency of the proposed data generation methods applied to CL. For example, about 4.0%, 3.5%, and 2.6% accuracy improvements for linear classification are observed on ImageNet-100, CIFAR-100, and CIFAR-10, respectively. Besides, up to 2x data efficiency for linear classification and up to 5x data efficiency for transfer learning are achieved.
Sustainable AI Processing at the Edge
Ollivier, Sรฉbastien, Li, Sheng, Tang, Yue, Chaudhuri, Chayanika, Zhou, Peipei, Tang, Xulong, Hu, Jingtong, Jones, Alex K.
Deep neural networks have become a popular algorithm for a variety of applications using mobile devices including smart phones but also recently expanding to connected and autonomous vehicles (CAVs), robotics, or even unmanned aerial vehicles (UAVs), and other smart infrastructure. Convolutional Neural Networks (CNNs) have been demonstrated to provide solutions to these problems with relatively high accuracy. While there have been many proposals to improve the performance and energy efficiency of CNN inference, these algorithms are too compute and data intensive to execute directly on mobile nodes typically operating with limited computational and energy capabilities. Thus, edge servers, now being deployed often in conjunction with advanced (e.g., 5G) wireless networks, have become a popular target to accelerate CNN inference. Moreover, due to their deployment in the field, edge servers must operate under size, weight, and power (SWaP) constraints, while serving many concurrent requests from mobile clients. Thus, to accelerate CNNs, these edge servers often use energy-efficient accelerators, reduced precision, or both to achieve fast response time while balancing requests from multiple clients and maintaining a low operational energy cost. Recently, there has been a trend to push online training to edge server nodes to avoid communicating large datasets from edge to cloud servers [1]. However, online training typically requires much higher precision and floating-point computation compared to inference. Unfortunately, the proliferation of computing, both the mobile devices, and the edge servers themselves, can come at the expense of negative environmental impacts.
Federated Contrastive Learning for Dermatological Disease Diagnosis via On-device Learning
Wu, Yawen, Zeng, Dewen, Wang, Zhepeng, Sheng, Yi, Yang, Lei, James, Alaina J., Shi, Yiyu, Hu, Jingtong
Deep learning models have been deployed in an increasing number of edge and mobile devices to provide healthcare. These models rely on training with a tremendous amount of labeled data to achieve high accuracy. However, for medical applications such as dermatological disease diagnosis, the private data collected by mobile dermatology assistants exist on distributed mobile devices of patients, and each device only has a limited amount of data. Directly learning from limited data greatly deteriorates the performance of learned models. Federated learning (FL) can train models by using data distributed on devices while keeping the data local for privacy. Existing works on FL assume all the data have ground-truth labels. However, medical data often comes without any accompanying labels since labeling requires expertise and results in prohibitively high labor costs. The recently developed self-supervised learning approach, contrastive learning (CL), can leverage the unlabeled data to pre-train a model, after which the model is fine-tuned on limited labeled data for dermatological disease diagnosis. However, simply combining CL with FL as federated contrastive learning (FCL) will result in ineffective learning since CL requires diverse data for learning but each device only has limited data. In this work, we propose an on-device FCL framework for dermatological disease diagnosis with limited labels. Features are shared in the FCL pre-training process to provide diverse and accurate contrastive information. After that, the pre-trained model is fine-tuned with local labeled data independently on each device or collaboratively with supervised federated learning on all devices. Experiments on dermatological disease datasets show that the proposed framework effectively improves the recall and precision of dermatological disease diagnosis compared with state-of-the-art methods.
Personalized Deep Learning for Ventricular Arrhythmias Detection on Medical IoT Systems
Jia, Zhenge, Wang, Zhepeng, Hong, Feng, Ping, Lichuan, Shi, Yiyu, Hu, Jingtong
Life-threatening ventricular arrhythmias (VA) are the leading cause of sudden cardiac death (SCD), which is the most significant cause of natural death in the US. The implantable cardioverter defibrillator (ICD) is a small device implanted to patients under high risk of SCD as a preventive treatment. The ICD continuously monitors the intracardiac rhythm and delivers shock when detecting the life-threatening VA. Traditional methods detect VA by setting criteria on the detected rhythm. However, those methods suffer from a high inappropriate shock rate and require a regular follow-up to optimize criteria parameters for each ICD recipient. To ameliorate the challenges, we propose the personalized computing framework for deep learning based VA detection on medical IoT systems. The system consists of intracardiac and surface rhythm monitors, and the cloud platform for data uploading, diagnosis, and CNN model personalization. We equip the system with real-time inference on both intracardiac and surface rhythm monitors. To improve the detection accuracy, we enable the monitors to detect VA collaboratively by proposing the cooperative inference. We also introduce the CNN personalization for each patient based on the computing framework to tackle the unlabeled and limited rhythm data problem. When compared with the traditional detection algorithm, the proposed method achieves comparable accuracy on VA rhythm detection and 6.6% reduction in inappropriate shock rate, while the average inference latency is kept at 71ms.
Standing on the Shoulders of Giants: Hardware and Neural Architecture Co-Search with Hot Start
Jiang, Weiwen, Yang, Lei, Dasgupta, Sakyasingha, Hu, Jingtong, Shi, Yiyu
Hardware and neural architecture co-search that automatically generates Artificial Intelligence (AI) solutions from a given dataset is promising to promote AI democratization; however, the amount of time that is required by current co-search frameworks is in the order of hundreds of GPU hours for one target hardware. This inhibits the use of such frameworks on commodity hardware. The root cause of the low efficiency in existing co-search frameworks is the fact that they start from a "cold" state (i.e., search from scratch). In this paper, we propose a novel framework, namely HotNAS, that starts from a "hot" state based on a set of existing pre-trained models (a.k.a. model zoo) to avoid lengthy training time. As such, the search time can be reduced from 200 GPU hours to less than 3 GPU hours. In HotNAS, in addition to hardware design space and neural architecture search space, we further integrate a compression space to conduct model compressing during the co-search, which creates new opportunities to reduce latency but also brings challenges. One of the key challenges is that all of the above search spaces are coupled with each other, e.g., compression may not work without hardware design support. To tackle this issue, HotNAS builds a chain of tools to design hardware to support compression, based on which a global optimizer is developed to automatically co-search all the involved search spaces. Experiments on ImageNet dataset and Xilinx FPGA show that, within the timing constraint of 5ms, neural architectures generated by HotNAS can achieve up to 5.79% Top-1 and 3.97% Top-5 accuracy gain, compared with the existing ones.