Yang, Zhihao
Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation
Ji, Haoyu, Chen, Bowen, Ren, Weihong, Huang, Wenze, Yang, Zhihao, Wang, Zhiyong, Liu, Honghai
--Skeleton-based T emporal Action Segmentation (ST AS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current ST AS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods overlook the intrinsic correlations among joints and actions within skeletal features, leading to a limited understanding of human movements. T o address this, we propose a T ext-Derived Relational Graph-Enhanced Network (TRG-Net) that leverages prior graphs generated by Large Language Models (LLM) to enhance both modeling and supervision. For modeling, the Dynamic Spatio-T emporal Fusion Modeling (DSFM) method incorporates T ext-Derived Joint Graphs (TJG) with channel-and frame-level dynamic adaptation to effectively model spatial relations, while integrating spatio-temporal core features during temporal modeling. For supervision, the Absolute-Relative Inter-Class Supervision (ARIS) method employs contrastive learning between action features and text embeddings to regularize the absolute class distributions, and utilizes T ext-Derived Action Graphs (T AG) to capture the relative inter-class relationships among action features. Additionally, we propose a Spatial-A ware Enhancement Processing (SAEP) method, which incorporates random joint occlusion and axial rotation to enhance spatial generalization. Performance evaluations on four public datasets demonstrate that TRG-Net achieves state-of-the-art results. EMPORAL Action Segmentation (T AS), an advanced task in video understanding, aims to segment and recognize each action within long, untrimmed video sequences of human activities [1]. Similar to how semantic segmentation predicts labels for each pixel in an image, T AS predicts action labels for each frame in a video. As a significant task in computer vision, T AS finds applications in various domains such as medical rehabilitation, [2], industrial monitoring [3], and activity analysis [4]. Haoyu Ji, Bowen Chen, Weihong Ren, Wenze Huang, Zhihao Y ang, Zhiyong Wang, and Honghai Liu are with the State Key Laboratory of Robotics and Systems, Harbin Institute of Technology Shenzhen, Shenzhen 518055, China (e-mail: jihaoyu1224@gmail.com, The code is available at https://github.com/HaoyuJi/TRG-Net. The text embeddings and relational graphs generated by large language models can serve as priors for enhancing modeling and supervision of action segmentation. Specifically, the text-derived joint graph effectively captures spatial correlations, while the text-derived action graph and action embeddings supervise the relationships and distributions of action classes. Existing T AS methods can be broadly categorized into two types based on input modality: Video-based T AS (VT AS) and Skeleton-based T AS (ST AS) [5]-[7].
Xpert: Empowering Incident Management with Query Recommendations via Large Language Models
Jiang, Yuxuan, Zhang, Chaoyun, He, Shilin, Yang, Zhihao, Ma, Minghua, Qin, Si, Kang, Yu, Dang, Yingnong, Rajmohan, Saravan, Lin, Qingwei, Zhang, Dongmei
Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents occurring within these systems can lead to service disruptions and adversely affect user experience. To swiftly resolve such incidents, on-call engineers depend on crafting domain-specific language (DSL) queries to analyze telemetry data. However, writing these queries can be challenging and time-consuming. This paper presents a thorough empirical study on the utilization of queries of KQL, a DSL employed for incident management in a large-scale cloud management system at Microsoft. The findings obtained underscore the importance and viability of KQL queries recommendation to enhance incident management. Building upon these valuable insights, we introduce Xpert, an end-to-end machine learning framework that automates KQL recommendation process. By leveraging historical incident data and large language models, Xpert generates customized KQL queries tailored to new incidents. Furthermore, Xpert incorporates a novel performance metric called Xcore, enabling a thorough evaluation of query quality from three comprehensive perspectives. We conduct extensive evaluations of Xpert, demonstrating its effectiveness in offline settings. Notably, we deploy Xpert in the real production environment of a large-scale incident management system in Microsoft, validating its efficiency in supporting incident management. To the best of our knowledge, this paper represents the first empirical study of its kind, and Xpert stands as a pioneering DSL query recommendation framework designed for incident management.
Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks
Luo, Ling, Ning, Jinzhong, Zhao, Yingwen, Wang, Zhijun, Ding, Zeyuan, Chen, Peng, Fu, Weiru, Han, Qinyu, Xu, Guangtao, Qiu, Yunzhi, Pan, Dinghao, Li, Jiru, Li, Hao, Feng, Wenduo, Tu, Senbo, Liu, Yuqi, Yang, Zhihao, Wang, Jian, Sun, Yuanyuan, Lin, Hongfei
Objective: Most existing fine-tuned biomedical large language models (LLMs) focus on enhancing performance in monolingual biomedical question answering and conversation tasks. To investigate the effectiveness of the fine-tuned LLMs on diverse biomedical NLP tasks in different languages, We present Taiyi, a bilingual fine-tuned LLM for diverse biomedical tasks. Materials and Methods: We first curated a comprehensive collection of 140 existing biomedical text mining datasets (102 English and 38 Chinese datasets) across over 10 task types. Subsequently, a two-stage strategy is proposed for supervised fine-tuning to optimize the model performance across varied tasks. Results: Experimental results on 13 test sets covering named entity recognition, relation extraction, text classification, question answering tasks demonstrate that Taiyi achieves superior performance compared to general LLMs. The case study involving additional biomedical NLP tasks further shows Taiyi's considerable potential for bilingual biomedical multi-tasking. Conclusion: Leveraging rich high-quality biomedical corpora and developing effective fine-tuning strategies can significantly improve the performance of LLMs within the biomedical domain. Taiyi shows the bilingual multi-tasking capability through supervised fine-tuning. However, those tasks such as information extraction that are not generation tasks in nature remain challenging for LLM-based generative approaches, and they still underperform the conventional discriminative approaches of smaller language models.
BokehOrNot: Transforming Bokeh Effect with Image Transformer and Lens Metadata Embedding
Yang, Zhihao, Lian, Wenyi, Lai, Siyuan
Bokeh effect is an optical phenomenon that offers a pleasant visual experience, typically generated by high-end cameras with wide aperture lenses. The task of bokeh effect transformation aims to produce a desired effect in one set of lenses and apertures based on another combination. Current models are limited in their ability to render a specific set of bokeh effects, primarily transformations from sharp to blur. In this paper, we propose a novel universal method for embedding lens metadata into the model and introducing a loss calculation method using alpha masks from the newly released Bokeh Effect Transformation Dataset(BETD) [3]. Based on the above techniques, we propose the BokehOrNot model, which is capable of producing both blur-to-sharp and sharp-to-blur bokeh effect with various combinations of lenses and aperture sizes. Our proposed model outperforms current leading bokeh rendering and image restoration models and renders visually natural bokeh effects. Our code is available at: https://github.com/indicator0/bokehornot.