AITopics | Lin, Tianwei

Collaborating Authors

Lin, Tianwei

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Lin, Tianwei, Zhang, Wenqiao, Li, Sijing, Yuan, Yuqian, Yu, Binhe, Li, Haoyuan, He, Wanggui, Jiang, Hao, Li, Mengze, Song, Xiaohui, Tang, Siliang, Xiao, Jun, Lin, Hui, Zhuang, Yueting, Ooi, Beng Chin

arXiv.org Artificial IntelligenceFeb-17-2025

Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is complemented by a tailored hierarchical visual perception approach and a three-stage learning strategy. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed at https://github.com/DCDmllm/HealthGPT.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.09838

Country: Asia (0.28)

Genre: Research Report > New Finding (0.88)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Nuclear Medicine (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Boosting Private Domain Understanding of Efficient MLLMs: A Tuning-free, Adaptive, Universal Prompt Optimization Framework

Liu, Jiang, Li, Bolin, Li, Haoyuan, Lin, Tianwei, Zhang, Wenqiao, Zhong, Tao, Yu, Zhelun, Wei, Jinghao, Cheng, Hao, Jiang, Hao, Lv, Zheqi, Li, Juncheng, Tang, Siliang, Zhuang, Yueting

arXiv.org Artificial IntelligenceDec-27-2024

Efficient multimodal large language models (EMLLMs), in contrast to multimodal large language models (MLLMs), reduce model size and computational costs and are often deployed on resource-constrained devices. However, due to data privacy concerns, existing open-source EMLLMs rarely have access to private domain-specific data during the pre-training process, making them difficult to directly apply in device-specific domains, such as certain business scenarios. To address this weakness, this paper focuses on the efficient adaptation of EMLLMs to private domains, specifically in two areas: 1) how to reduce data requirements, and 2) how to avoid parameter fine-tuning. Specifically, we propose a tun\textbf{\underline{I}}ng-free, a\textbf{\underline{D}}aptiv\textbf{\underline{E}}, univers\textbf{\underline{AL}} \textbf{\underline{Prompt}} Optimization Framework, abbreviated as \textit{\textbf{\ourmethod{}}} which consists of two stages: 1) Predefined Prompt, based on the reinforcement searching strategy, generate a prompt optimization strategy tree to acquire optimization priors; 2) Prompt Reflection initializes the prompt based on optimization priors, followed by self-reflection to further search and refine the prompt. By doing so, \ourmethod{} elegantly generates the ``ideal prompts'' for processing private domain-specific data. Note that our method requires no parameter fine-tuning and only a small amount of data to quickly adapt to the data distribution of private data. Extensive experiments across multiple tasks demonstrate that our proposed \ourmethod{} significantly improves both efficiency and performance compared to baselines.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.19684

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Gaussian Object Carver: Object-Compositional Gaussian Splatting with surfaces completion

Liu, Liu, Wang, Xinjie, Qiu, Jiaxiong, Lin, Tianwei, Zhou, Xiaolin, Su, Zhizhong

arXiv.org Artificial IntelligenceDec-2-2024

3D scene reconstruction is a foundational problem in computer vision. Despite recent advancements in Neural Implicit Representations (NIR), existing methods often lack editability and compositional flexibility, limiting their use in scenarios requiring high interactivity and object-level manipulation. In this paper, we introduce the Gaussian Object Carver (GOC), a novel, efficient, and scalable framework for object-compositional 3D scene reconstruction. GOC leverages 3D Gaussian Splatting (GS), enriched with monocular geometry priors and multi-view geometry regularization, to achieve high-quality and flexible reconstruction. Furthermore, we propose a zero-shot Object Surface Completion (OSC) model, which uses 3D priors from 3d object data to reconstruct unobserved surfaces, ensuring object completeness even in occluded areas. Experimental results demonstrate that GOC improves reconstruction efficiency and geometric fidelity. It holds promise for advancing the practical application of digital twins in embodied AI, AR/VR, and interactive simulation environments.

artificial intelligence, machine learning, reconstruction, (15 more...)

arXiv.org Artificial Intelligence

2412.02075

Country: Asia (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

Lin, Xuewu, Lin, Tianwei, Huang, Lichao, Xie, Hongyu, Su, Zhizhong

arXiv.org Artificial IntelligenceNov-27-2024

In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments. Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity. In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception. In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.

detection, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.14869

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Zhang, Wenqiao, Lin, Tianwei, Liu, Jiang, Shu, Fangxun, Li, Haoyuan, Zhang, Lei, Wanggui, He, Zhou, Hao, Lv, Zheqi, Jiang, Hao, Li, Juncheng, Tang, Siliang, Zhuang, Yueting

arXiv.org Artificial IntelligenceMar-20-2024

Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2403.13447

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction

Lin, Longzhong, Lin, Xuewu, Lin, Tianwei, Huang, Lichao, Xiong, Rong, Wang, Yue

arXiv.org Artificial IntelligenceDec-14-2023

Motion prediction is a crucial task in autonomous driving, and one of its major challenges lands in the multimodality of future behaviors. Many successful works have utilized mixture models which require identification of positive mixture components, and correspondingly fall into two main lines: prediction-based and anchor-based matching. The prediction clustering phenomenon in prediction-based matching makes it difficult to pick representative trajectories for downstream tasks, while the anchor-based matching suffers from a limited regression capability. In this paper, we introduce a novel paradigm, named Evolving and Distinct Anchors (EDA), to define the positive and negative components for multimodal motion prediction based on mixture models. We enable anchors to evolve and redistribute themselves under specific scenes for an enlarged regression capacity. Furthermore, we select distinct anchors before matching them with the ground truth, which results in impressive scoring performance. Our approach enhances all metrics compared to the baseline MTR, particularly with a notable relative reduction of 13.5% in Miss Rate, resulting in state-of-the-art performance on the Waymo Open Motion Dataset. Code is available at https://github.com/Longzhong-Lin/EDA.

anchor, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2312.09501

Genre: Research Report (0.82)

Industry: Transportation (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Symphonize 3D Semantic Scene Completion with Contextual Instance Queries

Jiang, Haoyi, Cheng, Tianheng, Gao, Naiyu, Zhang, Haoyang, Lin, Tianwei, Liu, Wenyu, Wang, Xinggang

arXiv.org Artificial IntelligenceNov-22-2023

`3D Semantic Scene Completion (SSC) has emerged as a nascent and pivotal undertaking in autonomous driving, aiming to predict voxel occupancy within volumetric scenes. However, prevailing methodologies primarily focus on voxel-wise feature aggregation, while neglecting instance semantics and scene context. In this paper, we present a novel paradigm termed Symphonies (Scene-from-Insts), that delves into the integration of instance queries to orchestrate 2D-to-3D reconstruction and 3D scene modeling. Leveraging our proposed Serial Instance-Propagated Attentions, Symphonies dynamically encodes instance-centric semantics, facilitating intricate interactions between image-based and volumetric domains. Simultaneously, Symphonies enables holistic scene comprehension by capturing context through the efficient fusion of instance queries, alleviating geometric ambiguity such as occlusion and perspective errors through contextual scene reasoning. Experimental results demonstrate that Symphonies achieves state-of-the-art performance on challenging benchmarks SemanticKITTI and SSCBench-KITTI-360, yielding remarkable mIoU scores of 15.04 and 18.58, respectively. These results showcase the paradigm's promising advancements. The code is available at https://github.com/hustvl/Symphonies.

artificial intelligence, machine learning, query, (12 more...)

arXiv.org Artificial Intelligence

2306.1567

Genre: Research Report > New Finding (0.34)

Industry:

Transportation > Ground > Road (0.35)
Information Technology (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.35)

Add feedback

Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Lin, Xuewu, Pei, Zixiang, Lin, Tianwei, Huang, Lichao, Su, Zhizhong

arXiv.org Artificial IntelligenceNov-20-2023

In autonomous driving perception systems, 3D detection and tracking are the two fundamental tasks. This paper delves deeper into this field, building upon the Sparse4D framework. We introduce two auxiliary training tasks (Temporal Instance Denoising and Quality Estimation) and propose decoupled attention to make structural improvements, leading to significant enhancements in detection performance. Additionally, we extend the detector into a tracker using a straightforward approach that assigns instance ID during inference, further highlighting the advantages of query-based algorithms. Extensive experiments conducted on the nuScenes benchmark validate the effectiveness of the proposed improvements. With ResNet50 as the backbone, we witnessed enhancements of 3.0%, 2.2%, and 7.6% in mAP, NDS, and AMOTA, achieving 46.9%, 56.1%, and 49.0%, respectively. Our best model achieved 71.9% NDS and 67.7% AMOTA on the nuScenes test set.

artificial intelligence, arxiv preprint arxiv, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2311.11722

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Industry:

Information Technology (0.34)
Transportation > Ground > Road (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Vision (0.92)
Information Technology > Artificial Intelligence > Robots (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Planning-oriented Autonomous Driving

Hu, Yihan, Yang, Jiazhi, Chen, Li, Li, Keyu, Sima, Chonghao, Zhu, Xizhou, Chai, Siqi, Du, Senyao, Lin, Tianwei, Wang, Wenhai, Lu, Lewei, Jia, Xiaosong, Liu, Qiang, Dai, Jifeng, Qiao, Yu, Li, Hongyang

arXiv.org Artificial IntelligenceMar-23-2023

Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction, and planning. In order to perform a wide diversity of tasks and achieve advanced-level intelligence, contemporary approaches either deploy standalone models for individual tasks, or design a multi-task paradigm with separate heads. However, they might suffer from accumulative errors or deficient task coordination. Instead, we argue that a favorable framework should be devised and optimized in pursuit of the ultimate goal, i.e., planning of the self-driving car. Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning. We introduce Unified Autonomous Driving (UniAD), a comprehensive framework up-to-date that incorporates full-stack driving tasks in one network. It is exquisitely devised to leverage advantages of each module, and provide complementary feature abstractions for agent interaction from a global perspective. Tasks are communicated with unified query interfaces to facilitate each other toward planning. We instantiate UniAD on the challenging nuScenes benchmark. With extensive ablations, the effectiveness of using such a philosophy is proven by substantially outperforming previous state-of-the-arts in all aspects. Code and models are public.

artificial intelligence, prediction, query, (17 more...)

arXiv.org Artificial Intelligence

2212.10156

Country: Asia > China (0.28)

Genre: Research Report (0.50)

Industry:

Transportation > Ground > Road (1.00)
Information Technology > Robotics & Automation (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Wu, Wenhao, He, Dongliang, Lin, Tianwei, Li, Fu, Gan, Chuang, Ding, Errui

arXiv.org Artificial IntelligenceDec-13-2020

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.

artificial intelligence, convolution, neural network, (19 more...)

arXiv.org Artificial Intelligence

2012.06977

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (0.91)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback