Goto

Collaborating Authors

 Chen, Ming


ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

arXiv.org Artificial Intelligence

Audio-driven human gesture synthesis is a crucial task with broad applications in virtual avatars, human-computer interaction, and creative content generation. Despite notable progress, existing methods often produce gestures that are coarse, lack expressiveness, and fail to fully align with audio semantics. To address these challenges, we propose ExGes, a novel retrieval-enhanced diffusion framework with three key designs: (1) a Motion Base Construction, which builds a gesture library using training dataset; (2) a Motion Retrieval Module, employing constrative learning and momentum distillation for fine-grained reference poses retreiving; and (3) a Precision Control Module, integrating partial masking and stochastic masking to enable flexible and fine-grained control. Experimental evaluations on BEAT2 demonstrate that ExGes reduces Fr\'echet Gesture Distance by 6.2\% and improves motion diversity by 5.3\% over EMAGE, with user studies revealing a 71.3\% preference for its naturalness and semantic relevance. Code will be released upon acceptance.


FishBargain: An LLM-Empowered Bargaining Agent for Online Fleamarket Platform Sellers

arXiv.org Artificial Intelligence

Different from traditional Business-to-Consumer e-commerce platforms~(e.g., Amazon), online fleamarket platforms~(e.g., Craigslist) mainly focus on individual sellers who are lack of time investment and business proficiency. Individual sellers often struggle with the bargaining process and thus the deal is unaccomplished. Recent advancements in Large Language Models(LLMs) demonstrate huge potential in various dialogue tasks, but those tasks are mainly in the form of passively following user's instruction. Bargaining, as a form of proactive dialogue task, represents a distinct art of dialogue considering the dynamism of environment and uncertainty of adversary strategies. In this paper, we propose an LLM-empowered bargaining agent designed for online fleamarket platform sellers, named as FishBargain. Specifically, FishBargain understands the chat context and product information, chooses both action and language skill considering possible adversary actions and generates utterances. FishBargain has been tested by thousands of individual sellers on one of the largest online fleamarket platforms~(Xianyu) in China. Both qualitative and quantitative experiments demonstrate that FishBargain can effectively help sellers make more deals.


Revolutionizing Battery Disassembly: The Design and Implementation of a Battery Disassembly Autonomous Mobile Manipulator Robot(BEAM-1)

arXiv.org Artificial Intelligence

The efficient disassembly of end-of-life electric vehicle batteries(EOL-EVBs) is crucial for green manufacturing and sustainable development. The current pre-programmed disassembly conducted by the Autonomous Mobile Manipulator Robot(AMMR) struggles to meet the disassembly requirements in dynamic environments, complex scenarios, and unstructured processes. In this paper, we propose a Battery Disassembly AMMR(BEAM-1) system based on NeuralSymbolic AI. It detects the environmental state by leveraging a combination of multi-sensors and neural predicates and then translates this information into a quasi-symbolic space. In real-time, it identifies the optimal sequence of action primitives through LLM-heuristic tree search, ensuring high-precision execution of these primitives. Additionally, it employs positional speculative sampling using intuitive networks and achieves the disassembly of various bolt types with a meticulously designed end-effector. Importantly, BEAM-1 is a continuously learning embodied intelligence system capable of subjective reasoning like a human, and possessing intuition. A large number of real scene experiments have proved that it can autonomously perceive, decide, and execute to complete the continuous disassembly of bolts in multiple, multi-category, and complex situations, with a success rate of 98.78%. This research attempts to use NeuroSymbolic AI to give robots real autonomous reasoning, planning, and learning capabilities. BEAM-1 realizes the revolution of battery disassembly. Its framework can be easily ported to any robotic system to realize different application scenarios, which provides a ground-breaking idea for the design and implementation of future embodied intelligent robotic systems.


TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models

arXiv.org Artificial Intelligence

The recently unprecedented advancements in Large Language Models (LLMs) have propelled the medical community by establishing advanced medical-domain models. However, due to the limited collection of medical datasets, there are only a few comprehensive benchmarks available to gauge progress in this area. In this paper, we introduce a new medical question-answering (QA) dataset that contains massive manual instruction for solving Traditional Chinese Medicine examination tasks, called TCMD. Specifically, our TCMD collects massive questions across diverse domains with their annotated medical subjects and thus supports us in comprehensively assessing the capability of LLMs in the TCM domain. Extensive evaluation of various general LLMs and medical-domain-specific LLMs is conducted. Moreover, we also analyze the robustness of current LLMs in solving TCM QA tasks by introducing randomness. The inconsistency of the experimental results also reveals the shortcomings of current LLMs in solving QA tasks. We also expect that our dataset can further facilitate the development of LLMs in the TCM area.


Towards Agile Robots: Intuitive Robot Position Speculation with Neural Networks

arXiv.org Artificial Intelligence

The robot position speculation, which determines where the chassis should move, is one key step to control the mobile manipulators. The target position must ensure the feasibility of chassis movement and manipulability, which is guaranteed by randomized sampling and kinematic checking in traditional methods. Addressing the demands of agile robotics, this paper proposes a robot position speculation network(RPSN), a learning-based approach to enhance the agility of mobile manipulators. The RPSN incorporates a differentiable inverse kinematic algorithm and a neural network. Through end-to-end training, the RPSN can speculate positions with a high success rate. We apply the RPSN to mobile manipulators disassembling end-of-life electric vehicle batteries (EOL-EVBs). Extensive experiments on various simulated environments and physical mobile manipulators demonstrate that the probability of the initial position provided by RPSN being the ideal position is 96.67%. From the kinematic constraint perspective, it achieves 100% generation of the ideal position on average within 1.28 attempts. Much lower than that of random sampling, 31.04. Moreover, the proposed method demonstrates superior data efficiency over pure neural network approaches. The proposed RPSN enables the robot to quickly infer feasible target positions by intuition. This work moves towards building agile robots that can act swiftly like humans.


Unbiasing Enhanced Sampling on a High-dimensional Free Energy Surface with Deep Generative Model

arXiv.org Artificial Intelligence

Biased enhanced sampling methods utilizing collective variables (CVs) are powerful tools for sampling conformational ensembles. Due to high intrinsic dimensions, efficiently generating conformational ensembles for complex systems requires enhanced sampling on high-dimensional free energy surfaces. While methods like temperature-accelerated molecular dynamics (TAMD) can adopt many CVs in a simulation, unbiasing the simulation requires accurate modeling of a high-dimensional CV probability distribution, which is challenging for traditional density estimation techniques. Here we propose an unbiasing method based on the score-based diffusion model, a deep generative learning method that excels in density estimation across complex data landscapes. We test the score-based diffusion unbiasing method on TAMD simulations. The results demonstrate that this unbiasing approach significantly outperforms traditional unbiasing methods, and can generate accurate unbiased conformational ensembles for simulations with a number of CVs higher than usual ranges.


Backdiff: a diffusion model for generalized transferable protein backmapping

arXiv.org Artificial Intelligence

Coarse-grained (CG) models play a crucial role in the study of protein structures, protein thermodynamic properties, and protein conformation dynamics. Due to the information loss in the coarse-graining process, backmapping from CG to all-atom configurations is essential in many protein design and drug discovery applications when detailed atomic representations are needed for in-depth studies. Despite recent progress in data-driven backmapping approaches, devising a backmapping method that can be universally applied across various CG models and proteins remains unresolved. In this work, we propose BackDiff, a new generative model designed to achieve generalization and reliability in the protein backmapping problem. BackDiff leverages the conditional score-based diffusion model with geometric representations. Since different CG models can contain different coarse-grained sites which include selected atoms (CG atoms) and simple CG auxiliary functions of atomistic coordinates (CG auxiliary variables), we design a self-supervised training framework to adapt to different CG atoms, and constrain the diffusion sampling paths with arbitrary CG auxiliary variables as conditions. Our method facilitates end-to-end training and allows efficient sampling across different proteins and diverse CG models without the need for retraining. Comprehensive experiments over multiple popular CG models demonstrate BackDiff's superior performance to existing state-of-the-art approaches, and generalization and flexibility that these approaches cannot achieve. A pretrained BackDiff model can offer a convenient yet reliable plug-and-play solution for protein researchers, enabling them to investigate further from their own CG models.


Learning Decoupled Retrieval Representation for Nearest Neighbour Neural Machine Translation

arXiv.org Artificial Intelligence

K-Nearest Neighbor Neural Machine Translation (kNN-MT) successfully incorporates external corpus by retrieving word-level representations at test time. Generally, kNN-MT borrows the off-the-shelf context representation in the translation task, e.g., the output of the last decoder layer, as the query vector of the retrieval task. In this work, we highlight that coupling the representations of these two tasks is sub-optimal for fine-grained retrieval. To alleviate it, we leverage supervised contrastive learning to learn the distinctive retrieval representation derived from the original context representation. We also propose a fast and effective approach to constructing hard negative samples. Experimental results on five domains show that our approach improves the retrieval accuracy and BLEU score compared to vanilla kNN-MT.


Advancing Ad Auction Realism: Practical Insights & Modeling Implications

arXiv.org Artificial Intelligence

This paper proposes a learning model of online ad auctions that allows for the following four key realistic characteristics of contemporary online auctions: (1) ad slots can have different values and click-through rates depending on users' search queries, (2) the number and identity of competing advertisers are unobserved and change with each auction, (3) advertisers only receive partial, aggregated feedback, and (4) payment rules are only partially specified. We model advertisers as agents governed by an adversarial bandit algorithm, independent of auction mechanism intricacies. Our objective is to simulate the behavior of advertisers for counterfactual analysis, prediction, and inference purposes. Our findings reveal that, in such richer environments, "soft floors" can enhance key performance metrics even when bidders are drawn from the same population. We further demonstrate how to infer advertiser value distributions from observed bids, thereby affirming the practical efficacy of our approach even in a more realistic auction setting.


VideoXum: Cross-modal Visual and Textural Summarization of Videos

arXiv.org Artificial Intelligence

Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.