AITopics | Xu, Dan

Collaborating Authors

Xu, Dan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Multi-Attribute Multi-Grained Adaptation of Pre-Trained Language Models for Text Understanding from Bayesian Perspective

Zhang, You, Wang, Jin, Yu, Liang-Chih, Xu, Dan, Zhang, Xuejie

arXiv.org Artificial IntelligenceMar-8-2025

Current neural networks often employ multi-domain-learning or attribute-injecting mechanisms to incorporate non-independent and identically distributed (non-IID) information for text understanding tasks by capturing individual characteristics and the relationships among samples. However, the extent of the impact of non-IID information and how these methods affect pre-trained language models (PLMs) remains unclear. This study revisits the assumption that non-IID information enhances PLMs to achieve performance improvements from a Bayesian perspective, which unearths and integrates non-IID and IID features. Furthermore, we proposed a multi-attribute multi-grained framework for PLM adaptations (M2A), which combines multi-attribute and multi-grained views to mitigate uncertainty in a lightweight manner. We evaluate M2A through prevalent text-understanding datasets and demonstrate its superior performance, mainly when data are implicitly non-IID, and PLMs scale larger.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.06085

Country: Asia (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.30)

Add feedback

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

Mi, Zhenxing, Wang, Kuan-Chieh, Qian, Guocheng, Ye, Hanrong, Liu, Runtao, Tulyakov, Sergey, Aberman, Kfir, Xu, Dan

arXiv.org Artificial IntelligenceFeb-12-2025

This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the $\textbf{LLM decoder}$ shares the same input feature space with $\textbf{diffusion decoders}$ that use the corresponding $\textbf{LLM encoder}$ for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.

decoder, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2502.10458

Country: Asia (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Human-Centric Foundation Models: Perception, Generation and Agentic Modeling

Tang, Shixiang, Wang, Yizhou, Chen, Lu, Wang, Yuan, Peng, Sida, Xu, Dan, Ouyang, Wanli

arXiv.org Artificial IntelligenceFeb-12-2025

In this survey, we present community appeals for a unified framework [Ci et al., 2023; a comprehensive overview of HcFMs by proposing Wang et al., 2023; Chen et al., 2024; Huang et al., 2024a] to a taxonomy that categorizes current approaches unlock systematic understanding and a wide range of humancentric into four groups: (1) Human-centric Perception applications for everybody. Foundation Models that capture fine-grained features Inspired by rapid advancements of general foundation models, for multi-modal 2D and 3D understanding; (2) e.g., large language models (LLMs), large vision models Human-centric AIGC Foundation Models that generate (LVMs) and text-to-image generative models, and their high-fidelity, diverse human-related content; presents of a paradigm shift from end-to-end learning of (3) Unified Perception and Generation Models that task-specific models to generalist models, a recent trend is integrate these capabilities to enhance both human to develop Human-centric Foundation Models (HcFM) that understanding and synthesis; and (4) Human-centric satisfy three criteria, namely generalization, broad applicability, Agentic Foundation Models that extend beyond perception and high fidelity. Generalization ensures robustness and generation to learn human-like intelligence to unseen conditions, enabling the model to perform consistently and interactive behaviors for humanoid embodied across varied environments.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.08556

Country: Asia (0.28)

Genre:

Research Report (1.00)
Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

MM-Ego: Towards Building Egocentric Multimodal LLMs

Ye, Hanrong, Zhang, Haotian, Daxberger, Erik, Chen, Lin, Lin, Zongyu, Li, Yanghao, Zhang, Bowen, You, Haoxuan, Xu, Dan, Gan, Zhe, Lu, Jiasen, Yang, Yinfei

arXiv.org Artificial IntelligenceOct-9-2024

This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding. To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. This is currently the largest egocentric QA dataset. Second, we contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths. We introduce a new de-biasing evaluation method to help mitigate the unavoidable language bias present in the models being evaluated. Third, we propose a specialized multimodal architecture featuring a novel "Memory Pointer Prompting" mechanism. This design includes a global glimpse step to gain an overarching understanding of the entire video and identify key visual information, followed by a fallback step that utilizes the key visual information to generate responses. This enables the model to more effectively comprehend extended video content. With the data, benchmark, and model, we successfully build MM-Ego, an egocentric multimodal LLM that shows powerful performance on egocentric video understanding.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.07177

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling

Kim, Jaehyeok, Wee, Dongyoon, Xu, Dan

arXiv.org Artificial IntelligenceJul-18-2024

This paper introduces Motion-oriented Compositional Neural Radiance Fields (MoCo-NeRF), a framework designed to perform free-viewpoint rendering of monocular human videos via novel non-rigid motion modeling approach. In the context of dynamic clothed humans, complex cloth dynamics generate non-rigid motions that are intrinsically distinct from skeletal articulations and critically important for the rendering quality. The conventional approach models non-rigid motions as spatial (3D) deviations in addition to skeletal transformations. However, it is either time-consuming or challenging to achieve optimal quality due to its high learning complexity without a direct supervision. To target this problem, we propose a novel approach of modeling non-rigid motions as radiance residual fields to benefit from more direct color supervision in the rendering and utilize the rigid radiance fields as a prior to reduce the complexity of the learning process. Our approach utilizes a single multiresolution hash encoding (MHE) to concurrently learn the canonical T-pose representation from rigid skeletal motions and the radiance residual field for non-rigid motions. Additionally, to further improve both training efficiency and usability, we extend MoCo-NeRF to support simultaneous training of multiple subjects within a single framework, thanks to our effective design for modeling non-rigid motions. This scalability is achieved through the integration of a global MHE and learnable identity codes in addition to multiple local MHEs. We present extensive results on ZJU-MoCap and MonoCap, clearly demonstrating state-of-the-art performance in both single- and multi-subject settings. The code and model will be made publicly available at the project page: https://stevejaehyeok.github.io/publications/moco-nerf.

artificial intelligence, non-rigid motion, radiance field, (16 more...)

arXiv.org Artificial Intelligence

2407.11962

Country: Asia (0.47)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

Learning Online Scale Transformation for Talking Head Video Generation

Hong, Fa-Ting, Xu, Dan

arXiv.org Artificial IntelligenceJul-13-2024

One-shot talking head video generation uses a source image and driving video to create a synthetic video where the source person's facial movements imitate those of the driving video. However, differences in scale between the source and driving images remain a challenge for face reenactment. Existing methods attempt to locate a frame in the driving video that aligns best with the source image, but imprecise alignment can result in suboptimal outcomes. To this end, we introduce a scale transformation module that can automatically adjust the scale of the driving image to fit that of the source image, by using the information of scale difference maintained in the detected keypoints of the source image and the driving frame. Furthermore, to keep perceiving the scale information of faces during the generation process, we incorporate the scale information learned from the scale transformation module into each layer of the generation process to produce a final result with an accurate scale. Our method can perform accurate motion transfer between the two images without any anchor frame, achieved through the contributions of the proposed online scale transformation facial reenactment network. Extensive experiments have demonstrated that our proposed method adjusts the scale of the driving face automatically according to the source face, and generates high-quality faces with an accurate scale in the cross-identity facial reenactment.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2407.09965

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.72)

Add feedback

Sample-efficient Imitative Multi-token Decision Transformer for Generalizable Real World Driving

Zhou, Hang, Xu, Dan, Ji, Yiding

arXiv.org Artificial IntelligenceJun-18-2024

The realm of autonomous driving research has witnessed remarkable progress, with simulation technologies [1][2][3][4] reaching unprecedented levels of realism and the burgeoning availability of real-world driving datasets [5][6][7][8]. Despite these advancements, data-driven planning continues to confront a formidable obstacle: the infinite state space and extensive data distribution characteristic of real-world driving. Imitation learning approaches encounter hurdles [9][10] when presented with scenarios that deviate from the training distribution, exemplified by rare events like emergency braking for unforeseen obstacles. Similarly, these methods grapple with long-tail distribution phenomena, such as navigating through unexpected weather conditions or handling the erratic movements of a jaywalking pedestrian. On the other hand, reinforcement learning (RL) strategies aim to cultivate policies through reward-based learning. RL has difficulty bridging the sim-real gap and sampling efficiency [11].

artificial intelligence, machine learning, reinforcement learning, (11 more...)

arXiv.org Artificial Intelligence

2407.02508

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Industry: Transportation > Ground > Road (0.89)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

X-VILA: Cross-Modality Alignment for Large Language Model

Ye, Hanrong, Huang, De-An, Lu, Yao, Yu, Zhiding, Ping, Wei, Tao, Andrew, Kautz, Jan, Han, Song, Xu, Dan, Molchanov, Pavlo, Yin, Hongxu

arXiv.org Artificial IntelligenceMay-29-2024

We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data. The project will be made open-source.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2405.19335

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Enhance Planning with Physics-informed Safety Controller for End-to-end Autonomous Driving

Zhou, Hang, Liu, Haichao, Lu, Hongliang, Xu, Dan, Ma, Jun, Ji, Yiding

arXiv.org Artificial IntelligenceMay-5-2024

Recent years have seen a growing research interest in applications of Deep Neural Networks (DNN) on autonomous vehicle technology. The trend started with perception and prediction a few years ago and it is gradually being applied to motion planning tasks. Despite the performance of networks improve over time, DNN planners inherit the natural drawbacks of Deep Learning. Learning-based planners have limitations in achieving perfect accuracy on the training dataset and network performance can be affected by out-of-distribution problem. In this paper, we propose FusionAssurance, a novel trajectory-based end-to-end driving fusion framework which combines physics-informed control for safety assurance. By incorporating Potential Field into Model Predictive Control, FusionAssurance is capable of navigating through scenarios that are not included in the training dataset and scenarios where neural network fail to generalize. The effectiveness of the approach is demonstrated by extensive experiments under various scenarios on the CARLA benchmark.

artificial intelligence, controller, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2405.00316

Country: Asia > China (0.30)

Genre: Research Report (0.40)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Personalized LoRA for Human-Centered Text Understanding

Zhang, You, Wang, Jin, Yu, Liang-Chih, Xu, Dan, Zhang, Xuejie

arXiv.org Artificial IntelligenceMar-10-2024

Effectively and efficiently adapting a pre-trained language model (PLM) for human-centered text understanding (HCTU) is challenging since user tokens are million-level in most personalized applications and do not have concrete explicit semantics. A standard and parameter-efficient approach (e.g., LoRA) necessitates memorizing numerous suits of adapters for each user. In this work, we introduce a personalized LoRA (PLoRA) with a plug-and-play (PnP) framework for the HCTU task. PLoRA is effective, parameter-efficient, and dynamically deploying in PLMs. Moreover, a personalized dropout and a mutual information maximizing strategies are adopted and hence the proposed PLoRA can be well adapted to few/zero-shot learning scenarios for the cold-start issue. Experiments conducted on four benchmark datasets show that the proposed method outperforms existing methods in full/few/zero-shot learning scenarios for the HCTU task, even though it has fewer trainable parameters. For reproducibility, the code for this paper is available at: https://github.com/yoyo-yun/PLoRA.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2403.06208

Country: Asia (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback