AITopics | Liu, Shilong

Collaborating Authors

Liu, Shilong

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models

Lin, Zihao, Basu, Samyadeep, Beigi, Mohammad, Manjunatha, Varun, Rossi, Ryan A., Wang, Zichao, Zhou, Yufan, Balasubramanian, Sriram, Zarei, Arman, Rezaei, Keivan, Shen, Ying, Yao, Barry Menglong, Xu, Zhiyang, Liu, Qin, Zhang, Yuxiang, Sun, Yan, Liu, Shilong, Shen, Li, Li, Hongxuan, Feizi, Soheil, Huang, Lifu

arXiv.org Artificial IntelligenceFeb-22-2025

The rise of foundation models has transformed machine learning research, prompting efforts to uncover their inner workings and develop more efficient and reliable applications for better control. While significant progress has been made in interpreting Large Language Models (LLMs), multimodal foundation models (MMFMs) - such as contrastive vision-language models, generative vision-language models, and text-to-image models - pose unique interpretability challenges beyond unimodal frameworks. Despite initial studies, a substantial gap remains between the interpretability of LLMs and MMFMs. This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems. By systematically reviewing current MMFM analysis techniques, we propose a structured taxonomy of interpretability methods, compare insights across unimodal and multimodal architectures, and highlight critical research gaps.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.17516

Country:

Europe (0.92)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre:

Research Report (1.00)
Overview (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

Li, Zhiqi, Chen, Guo, Liu, Shilong, Wang, Shihao, VS, Vibashan, Ji, Yishen, Lan, Shiyi, Zhang, Hao, Zhao, Yilin, Radhakrishnan, Subhashree, Chang, Nadine, Sapra, Karan, Deshmukh, Amala Sanjay, Rintamaki, Tuomas, Le, Matthieu, Karmanov, Ilia, Voegtle, Lukas, Fischer, Philipp, Huang, De-An, Roman, Timo, Lu, Tong, Alvarez, Jose M., Catanzaro, Bryan, Kautz, Jan, Tao, Andrew, Liu, Guilin, Yu, Zhiding

arXiv.org Artificial IntelligenceJan-20-2025

Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, showing the key role of data strategy in developing frontier VLMs. By studying and building our post-training data strategy from scratch, we share detailed insights into the development processes, aiming to benefit the development of competitive models for the open-source community. Our introduced data strategy, together with training recipes and model design, leads to a family of performant VLMs named Eagle2. Specifically, Eagle2-9B achieves state-of-the-art results across various multimodal benchmarks, matching certain competitive models with up to 70B parameters.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2501.14818

Country: Asia (0.14)

Genre: Research Report > New Finding (0.45)

Industry: Education (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

Li, Binxu, Yan, Tiankai, Pan, Yuanting, Xu, Zhe, Luo, Jie, Ji, Ruiyang, Liu, Shilong, Dong, Haoyu, Lin, Zihao, Wang, Yixin

arXiv.org Artificial IntelligenceJul-2-2024

Multi-Modal Large Language Models (MLLMs), despite being successful, exhibit limited generality and often fall short when compared to specialized models. Recently, LLM-based agents have been developed to address these challenges by selecting appropriate specialized models as tools based on user inputs. However, such advancements have not been extensively explored within the medical domain. To bridge this gap, this paper introduces the first agent explicitly designed for the medical field, named \textbf{M}ulti-modal \textbf{Med}ical \textbf{Agent} (MMedAgent). We curate an instruction-tuning dataset comprising six medical tools solving seven tasks, enabling the agent to choose the most suitable tools for a given task. Comprehensive experiments demonstrate that MMedAgent achieves superior performance across a variety of medical tasks compared to state-of-the-art open-source methods and even the closed-source model, GPT-4o. Furthermore, MMedAgent exhibits efficiency in updating and integrating new medical tools.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2407.02483

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Health Care Equipment & Supplies (0.81)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

Xu, Tianqi, Chen, Linyao, Wu, Dai-Jie, Chen, Yanjun, Zhang, Zecheng, Yao, Xiang, Xie, Zhiqiang, Chen, Yongchao, Liu, Shilong, Qian, Bochen, Torr, Philip, Ghanem, Bernard, Li, Guohao

arXiv.org Artificial IntelligenceJul-1-2024

The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the complexities of constructing tasks and evaluators. To overcome these limitations, we introduce Crab, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging Crab, we developed a cross-platform Crab Benchmark-v0 comprising 100 tasks in computer desktop and mobile phone environments. We evaluated four advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 35.26%. All framework code, agent code, and task datasets are publicly available at https://github.com/camel-ai/crab.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2407.01511

Country: North America > United States > California (0.14)

Genre:

Workflow (1.00)
Research Report > New Finding (1.00)

Industry:

Information Technology > Software (0.49)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

Bao, Fan, Xiang, Chendong, Yue, Gang, He, Guande, Zhu, Hongzhou, Zheng, Kaiwen, Zhao, Min, Liu, Shilong, Wang, Yaole, Zhu, Jun

arXiv.org Artificial IntelligenceMay-7-2024

We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora -- the most powerful reported text-to-video generator. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation, video prediction and subject-driven generation, which demonstrate promising results.

artificial intelligence, machine learning, vidu, (12 more...)

arXiv.org Artificial Intelligence

2405.04233

Country: Europe > Germany (0.14)

Genre: Research Report (0.40)

Industry:

Media > Photography (0.70)
Media > Film (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

TAPTR: Tracking Any Point with Transformers as Detection

Li, Hongyang, Zhang, Hao, Liu, Shilong, Zeng, Zhaoyang, Ren, Tianhe, Li, Feng, Zhang, Lei

arXiv.org Artificial IntelligenceMar-19-2024

In this paper, we propose a simple and strong framework for Tracking Any Point with TRansformers (TAPTR). Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP. In the proposed framework, in each video frame, each tracking point is represented as a point query, which consists of a positional part and a content part. As in DETR, each query (its position and content feature) is naturally updated layer by layer. Its visibility is predicted by its updated content feature. Queries belonging to the same tracking point can exchange information through self-attention along the temporal dimension. As all such operations are well-designed in DETR-like algorithms, the model is conceptually very simple. We also adopt some useful designs such as cost volume from optical flow models and develop simple designs to provide long temporal information while mitigating the feature drifting issue. Our framework demonstrates strong performance with state-of-the-art performance on various TAP datasets with faster inference speed.

artificial intelligence, content feature, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2403.13042

Country: Asia > China (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Interfacing Foundation Models' Embeddings

Zou, Xueyan, Li, Linjie, Wang, Jianfeng, Yang, Jianwei, Ding, Mingyu, Yang, Zhengyuan, Li, Feng, Zhang, Hao, Liu, Shilong, Aravinthan, Arul, Lee, Yong Jae, Wang, Lijuan

arXiv.org Artificial IntelligenceDec-12-2023

We present FIND, a generalized interface for aligning foundation models' embeddings. As shown in teaser figure, a lightweight transformer interface without tuning any foundation model weights is enough for a unified image (segmentation) and dataset-level (retrieval) understanding. The proposed interface has the following favorable attributes: (1) Generalizable. It applies to various tasks spanning retrieval, segmentation, \textit{etc.}, under the same architecture and weights. (2) Prototypable. Different tasks are able to be implemented through prototyping attention masks and embedding types. (3) Extendable. The proposed interface is adaptive to new tasks, and new models. (4) Interleavable. With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. In light of the interleaved embedding space, we introduce the FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleave segmentation and retrieval. Our approach achieves state-of-the-art performance on FIND-Bench and competitive performance on standard retrieval and segmentation settings. The training, evaluation, and demo code as well as the dataset have been released at https://github.com/UX-Decoder/FIND.

large language model, machine learning, segmentation, (19 more...)

arXiv.org Artificial Intelligence

2312.07532

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
(2 more...)

Add feedback

Visual In-Context Prompting

Li, Feng, Jiang, Qing, Zhang, Hao, Ren, Tianhe, Liu, Shilong, Zou, Xueyan, Xu, Huaizhe, Li, Hongyang, Li, Chunyuan, Yang, Jianwei, Zhang, Lei, Gao, Jianfeng

arXiv.org Artificial IntelligenceNov-22-2023

In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities, but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object, falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper, we introduce a universal visual in-context prompting framework for both tasks. In particular, we build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points. We further enhance it to take an arbitrary number of reference image segments as the context. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities to refer and detect, yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets. By joint training on COCO and SA-1B, our model achieves $57.7$ PQ on COCO and $23.2$ PQ on ADE20K. Code will be available at https://github.com/UX-Decoder/DINOv.

large language model, machine learning, segmentation, (21 more...)

arXiv.org Artificial Intelligence

2311.13601

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Liu, Shilong, Cheng, Hao, Liu, Haotian, Zhang, Hao, Li, Feng, Ren, Tianhe, Zou, Xueyan, Yang, Jianwei, Su, Hang, Zhu, Jun, Zhang, Lei, Gao, Jianfeng, Li, Chunyuan

arXiv.org Artificial IntelligenceNov-9-2023

LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

artificial intelligence, creating multimodal agent, llava-plus, (2 more...)

arXiv.org Artificial Intelligence

2311.05437

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence (0.53)

Add feedback

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising

Li, Feng, Zhang, Hao, Liu, Shilong, Guo, Jian, Ni, Lionel M., Zhang, Lei

arXiv.org Artificial IntelligenceDec-8-2022

We present in this paper a novel denoising training method to speedup DETR (DEtection TRansformer) training and offer a deepened understanding of the slow convergence issue of DETR-like methods. We show that the slow convergence results from the instability of bipartite graph matching which causes inconsistent optimization goals in early training stages. To address this issue, except for the Hungarian loss, our method additionally feeds ground-truth bounding boxes with noises into Transformer decoder and trains the model to reconstruct the original boxes, which effectively reduces the bipartite graph matching difficulty and leads to a faster convergence. Our method is universal and can be easily plugged into any DETR-like methods by adding dozens of lines of code to achieve a remarkable improvement. As a result, our DN-DETR results in a remarkable improvement ($+1.9$AP) under the same setting and achieves the best result (AP $43.4$ and $48.6$ with $12$ and $50$ epochs of training respectively) among DETR-like methods with ResNet-$50$ backbone. Compared with the baseline under the same setting, DN-DETR achieves comparable performance with $50\%$ training epochs. Code is available at \url{https://github.com/FengLi-ust/DN-DETR}.

artificial intelligence, dn-detr, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2203.01305

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback