AITopics | He, Qian

Collaborating Authors

He, Qian

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Phantom: Subject-consistent video generation via cross-modal alignment

Liu, Lijie, Ma, Tianxiang, Li, Bingchuan, Chen, Zhuowei, Liu, Jiawei, He, Qian, Wu, Xinglong

arXiv.org Artificial IntelligenceFeb-16-2025

The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.11079

Genre: Research Report (0.82)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

GLAM: Global-Local Variation Awareness in Mamba-based World Model

He, Qian, Liang, Wenqi, Hao, Chunhui, Sun, Gan, Tian, Jiandong

arXiv.org Artificial IntelligenceJan-21-2025

Mimicking the real interaction trajectory in the inference of the world model has been shown to improve the sample efficiency of model-based reinforcement learning (MBRL) algorithms. Many methods directly use known state sequences for reasoning. However, this approach fails to enhance the quality of reasoning by capturing the subtle variation between states. Much like how humans infer trends in event development from this variation, in this work, we introduce Global-Local variation Awareness Mamba-based world model (GLAM) that improves reasoning quality by perceiving and predicting variation between states. GLAM comprises two Mambabased parallel reasoning modules, GMamba and LMamba, which focus on perceiving variation from global and local perspectives, respectively, during the reasoning process. GMamba focuses on identifying patterns of variation between states in the input sequence and leverages these patterns to enhance the prediction of future state variation. LMamba emphasizes reasoning about unknown information, such as rewards, termination signals, and visual representations, by perceiving variation in adjacent states. By integrating the strengths of the two modules, GLAM accounts for highervalue variation in environmental changes, providing the agent with more efficient imagination-based training. We demonstrate that our method outperforms existing methods in normalized human scores on the Atari 100k benchmark.

machine learning, reinforcement learning, world model, (18 more...)

arXiv.org Artificial Intelligence

2501.11949

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback

I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength

Feng, Wanquan, Liu, Jiawei, Tu, Pengqi, Qi, Tianhao, Sun, Mingzhen, Ma, Tianxiang, Zhao, Songtao, Zhou, Siyu, He, Qian

arXiv.org Artificial IntelligenceNov-25-2024

Video generation technologies are developing rapidly and have broad potential applications. Among these technologies, camera control is crucial for generating professional-quality videos that accurately meet user expectations. However, existing camera control methods still suffer from several limitations, including control precision and the neglect of the control for subject motion dynamics. In this work, we propose I2VControl-Camera, a novel camera control method that significantly enhances controllability while providing adjustability over the strength of subject motion. To improve control precision, we employ point trajectory in the camera coordinate system instead of only extrinsic matrix information as our control signal. To accurately control and adjust the strength of subject motion, we explicitly model the higher-order components of the video trajectory expansion, not merely the linear terms, and design an operator that effectively represents the motion strength. We use an adapter architecture that is independent of the base model structure. Experiments on static and dynamic scenes show that our framework outperformances previous methods both quantitatively and qualitatively. The project page is: https://wanquanf.github.io/I2VControlCamera .

artificial intelligence, machine learning, motion strength, (12 more...)

arXiv.org Artificial Intelligence

2411.06525

Genre: Research Report (0.64)

Industry:

Media > Photography (0.84)
Media > Film (0.84)
Media > Television (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Graphics (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

FashionR2R: Texture-preserving Rendered-to-Real Image Translation with Diffusion Models

Hu, Rui, He, Qian, He, Gaofeng, Zhuang, Jiedong, Chen, Huang, Liu, Huafeng, Wang, Huamin

arXiv.org Artificial IntelligenceOct-18-2024

Modeling and producing lifelike clothed human images has attracted researchers' attention from different areas for decades, with the complexity from highly articulated and structured content. Rendering algorithms decompose and simulate the imaging process of a camera, while are limited by the accuracy of modeled variables and the efficiency of computation. Generative models can produce impressively vivid human images, however still lacking in controllability and editability. This paper studies photorealism enhancement of rendered images, leveraging generative power from diffusion models on the controlled basis of rendering. We introduce a novel framework to translate rendered images into their realistic counterparts, which consists of two stages: Domain Knowledge Injection (DKI) and Realistic Image Generation (RIG). In DKI, we adopt positive (real) domain finetuning and negative (rendered) domain embedding to inject knowledge into a pretrained Text-to-image (T2I) diffusion model. In RIG, we generate the realistic image corresponding to the input rendered image, with a Texture-preserving Attention Control (TAC) to preserve fine-grained clothing textures, exploiting the decoupled features encoded in the UNet structure. Additionally, we introduce SynFashion dataset, featuring high-quality digital clothing images with diverse textures. Extensive experimental results demonstrate the superiority and effectiveness of our method in rendered-to-real image translation.

artificial intelligence, machine learning, proceedings, (14 more...)

arXiv.org Artificial Intelligence

2410.14429

Country: Asia (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Transforming Wearable Data into Health Insights using Large Language Model Agents

Merrill, Mike A., Paruchuri, Akshay, Rezaei, Naghmeh, Kovacs, Geza, Perez, Javier, Liu, Yun, Schenck, Erik, Hammerquist, Nova, Sunshine, Jake, Tailor, Shyam, Ayush, Kumar, Su, Hao-Wei, He, Qian, McLean, Cory Y., Malhotra, Mark, Patel, Shwetak, Zhan, Jiening, Althoff, Tim, McDuff, Daniel, Liu, Xin

arXiv.org Artificial IntelligenceJun-11-2024

Personal health data, often derived from personal devices such as wearables, are distinguished by their multi-dimensional, continuous and longitudinal measurements that capture granular observations of physiology and behavior in-situ rather than in a clinical setting. Research studies have highlighted the significant health impacts of physical activity and sleep patterns, emphasizing the potential for wearable-derived data to reveal personalized health insights and promote positive behavior changes [1, 4, 30, 46, 47]. For example, individuals with a device-measured Physical Activity Energy Expenditure (PAEE) that is 5 kJ/kg/day higher had a 37% lower premature mortality risk [47]. Those with frequent sleep disturbances were associated with an increase in risk of hypertension, diabetes and cardiovascular diseases [9, 30]. A large meta-analysis suggests that activity trackers improve physical activity and promote weight loss, with users taking 1800 extra steps per day [16]. Despite these gross benefits, using wearable data to derive intelligent responses and insights to personal health queries is non-trivial. These data are usually collected without clinical supervision and users often do not have access to the expertise that could aid in data interpretation. For example, a common question of wearable device users is "How can I get better sleep?". Though a seemingly straightforward question, arriving at an ideal response would involve performing a series of complex, independent analytical steps across multiple irregularly sampled time series such as: checking the availability of recent data, deciding on metrics to optimize (e.g.

bioinformatics, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2406.06464

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (1.00)
Health & Medicine > Therapeutic Area > Endocrinology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Consumer Health (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)

Add feedback

Never-Ending Behavior-Cloning Agent for Robotic Manipulation

Liang, Wenqi, Sun, Gan, He, Qian, Ren, Yu, Dong, Jiahua, Cong, Yang

arXiv.org Artificial IntelligenceJun-7-2024

Relying on multi-modal observations, embodied robots could perform multiple robotic manipulation tasks in unstructured real-world environments. However, most language-conditioned behavior-cloning agents still face existing long-standing challenges, i.e., 3D scene representation and human-level task learning, when adapting into new sequential tasks in practical scenarios. We here investigate these above challenges with NBAgent in embodied robots, a pioneering language-conditioned Never-ending Behavior-cloning Agent. It can continually learn observation knowledge of novel 3D scene semantics and robot manipulation skills from skill-shared and skill-specific attributes, respectively. Specifically, we propose a skill-sharedsemantic rendering module and a skill-shared representation distillation module to effectively learn 3D scene semantics from skill-shared attribute, further tackling 3D scene representation overlooking. Meanwhile, we establish a skill-specific evolving planner to perform manipulation knowledge decoupling, which can continually embed novel skill-specific knowledge like human from latent and low-rank space. Finally, we design a never-ending embodied robot manipulation benchmark, and expensive experiments demonstrate the significant performance of our method. Visual results, code, and dataset are provided at: https://neragent.github.io.

artificial intelligence, knowledge, natural language, (15 more...)

arXiv.org Artificial Intelligence

2403.00336

Country: Asia > China (0.14)

Genre: Research Report (0.40)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Design Booster: A Text-Guided Diffusion Model for Image Translation with Spatial Layout Preservation

Sun, Shiqi, Fang, Shancheng, He, Qian, Liu, Wei

arXiv.org Artificial IntelligenceFeb-4-2023

Diffusion models are able to generate photorealistic images in arbitrary scenes. However, when applying diffusion models to image translation, there exists a trade-off between maintaining spatial structure and high-quality content. Besides, existing methods are mainly based on test-time optimization or fine-tuning model for each input image, which are extremely time-consuming for practical applications. To address these issues, we propose a new approach for flexible image translation by learning a layout-aware image condition together with a text condition. Specifically, our method co-encodes images and text into a new domain during the training phase. In the inference stage, we can choose images/text or both as the conditions for each time step, which gives users more flexible control over layout and content. Experimental comparisons of our method with state-of-the-art methods demonstrate our model performs best in both style image translation and semantic image translation and took the shortest time.

artificial intelligence, image translation, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2302.02284

Genre: Research Report (0.69)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback