AITopics | Song, Lin

Collaborating Authors

Song, Lin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

Yang, Rui, Song, Lin, Xiao, Yicheng, Huang, Runhui, Ge, Yixiao, Shan, Ying, Zhao, Hengshuang

arXiv.org Artificial IntelligenceMar-12-2025

Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.14694

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Sports > Basketball (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)

Add feedback

GrootVL: Tree Topology is All You Need in State Space Model

Xiao, Yicheng, Song, Lin, Huang, Shaoli, Wang, Jiangshan, Song, Siyu, Ge, Yixiao, Li, Xiu, Shan, Ying

arXiv.org Artificial IntelligenceJun-4-2024

The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the GrootVL network, which first dynamically generates a tree topology based on spatial relationships and input features. Then, feature propagation is performed based on this graph, thereby breaking the original sequence constraints to achieve stronger representation capabilities. Additionally, we introduce a linear complexity dynamic programming algorithm to enhance long-range interactions without increasing computational cost. GrootVL is a versatile multimodal framework that can be applied to both visual and textual tasks. Extensive experiments demonstrate that our method significantly outperforms existing structured state space models on image classification, object detection and segmentation. Besides, by fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2406.02395

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)

Add feedback

Efficient and Interaction-Aware Trajectory Planning for Autonomous Vehicles with Particle Swarm Optimization

Song, Lin, Isele, David, Hovakimyan, Naira, Bae, Sangjae

arXiv.org Artificial IntelligenceFeb-2-2024

Abstract-- This paper introduces a novel numerical approach to achieving smooth lane-change trajectories in autonomous driving scenarios. The generation of smooth and dynamically feasible trajectories for the lane change maneuver is facilitated by combining polynomial curve fitting with particle propagation, which can account for vehicle dynamics. The proposed planning algorithm is capable of determining feasible trajectories with real-time computation capability. The simulation results validate the efficacy and effectiveness of our proposed approach. One example of this is Neural I. INTRODUCTION Network Model Predictive Control (NNMPC) [11,12], which We consider motion planning for autonomous vehicles in attempts to solve merging in dense traffic by combining highly dense traffic scenarios, as depicted in Figure 1.

artificial intelligence, evolutionary algorithm, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2402.01575

Country:

North America > United States > California (0.28)
North America > United States > Illinois > Champaign County > Urbana (0.14)

Genre: Research Report (0.64)

Industry:

Automobiles & Trucks (0.67)
Information Technology (0.66)
Energy > Oil & Gas > Upstream (0.50)
Transportation > Ground > Road (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Add feedback

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

Ding, Xiaohan, Zhang, Yiyuan, Ge, Yixiao, Zhao, Sijie, Song, Lin, Yue, Xiangyu, Shan, Ying

arXiv.org Artificial IntelligenceNov-27-2023

Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention, but there are two unresolved and critical issues that demand further investigation. 1) The architectures of existing large-kernel ConvNets largely follow the design principles of conventional ConvNets or transformers, while the architectural design for large-kernel ConvNets remains under-addressed. 2) As transformers have dominated multiple modalities, it remains to be investigated whether ConvNets also have a strong universal perception ability in domains beyond vision. In this paper, we contribute from two aspects. 1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep. Following such guidelines, our proposed large-kernel ConvNet shows leading performance in image recognition. For example, our models achieve an ImageNet accuracy of 88.0%, ADE20K mIoU of 55.6%, and COCO box AP of 56.4%, demonstrating better performance and higher speed than a number of recently proposed powerful competitors. 2) We discover that large kernels are the key to unlocking the exceptional performance of ConvNets in domains where they were originally not proficient. With certain modality-related preprocessing approaches, the proposed model achieves state-of-the-art performance on time-series forecasting and audio recognition tasks even without modality-specific customization to the architecture. Code and all the models at https://github.com/AILab-CVC/UniRepLKNet.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2311.15599

Country:

North America > United States (0.28)
Europe (0.28)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

InstructDET: Diversifying Referring Object Detection with Generalized Instructions

Dang, Ronghao, Feng, Jiangyan, Zhang, Haodong, Ge, Chongjian, Song, Lin, Gong, Lijun, Liu, Chengju, Chen, Qijun, Zhu, Feng, Zhao, Rui, Song, Yibing

arXiv.org Artificial IntelligenceOct-16-2023

We propose InstructDET, a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions. While deriving from referring expressions (REC), the instructions we leverage are greatly diversified to encompass common user intentions related to object detection. For one image, we produce tremendous instructions that refer to every single object and different combinations of multiple objects. Each instruction and its corresponding object bounding boxes (bbxs) constitute one training data pair. In order to encompass common detection expressions, we involve emerging vision-language model (VLM) and large language model (LLM) to generate instructions guided by text prompts and object bbxs, as the generalizations of foundation models are effective to produce human-like expressions (e.g., describing object property, category, and relationship). We name our constructed dataset as InDET. It contains images, bbxs and generalized instructions that are from foundation models. Our InDET is developed from existing REC datasets and object detection datasets, with the expanding potential that any image with object bbxs can be incorporated through using our InstructDET method. By using our InDET dataset, we show that a conventional ROD model surpasses existing methods on standard REC datasets and our InDET test set. Our data-centric method InstructDET, with automatic data expansion by leveraging foundation models, directs a promising field that ROD can be greatly diversified to execute common object detection instructions.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2310.05136

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Industry:

Law Enforcement & Public Safety > Fire & Emergency Services (0.47)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

DiffTune: Auto-Tuning through Auto-Differentiation

Cheng, Sheng, Kim, Minkyung, Song, Lin, Yang, Chengyu, Jin, Yiquan, Wang, Shenlong, Hovakimyan, Naira

arXiv.org Artificial IntelligenceAug-15-2023

The performance of robots in high-level tasks depends on the quality of their lower-level controller, which requires fine-tuning. However, the intrinsically nonlinear dynamics and controllers make tuning a challenging task when it is done by hand. In this paper, we present DiffTune, a novel, gradient-based automatic tuning framework. We formulate the controller tuning as a parameter optimization problem. Our method unrolls the dynamical system and controller as a computational graph and updates the controller parameters through gradient-based optimization. The gradient is obtained using sensitivity propagation, which is the only method for gradient computation when tuning for a physical system instead of its simulated counterpart. Furthermore, we use $\mathcal{L}_1$ adaptive control to compensate for the uncertainties (that unavoidably exist in a physical system) such that the gradient is not biased by the unmodelled uncertainties. We validate the DiffTune on a Dubin's car and a quadrotor in challenging simulation environments. In comparison with state-of-the-art auto-tuning methods, DiffTune achieves the best performance in a more efficient manner owing to its effective usage of the first-order information of the system. Experiments on tuning a nonlinear controller for quadrotor show promising results, where DiffTune achieves 3.5x tracking error reduction on an aggressive trajectory in only 10 trials over a 12-dimensional controller parameter space.

artificial intelligence, controller, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2209.10021

Country:

Europe (0.93)
Asia (0.67)
North America > United States > Illinois (0.14)

Genre:

Research Report (0.63)
Overview (0.46)

Industry:

Energy (0.46)
Leisure & Entertainment > Games > Computer Games (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
(2 more...)

Add feedback

BoxSnake: Polygonal Instance Segmentation with Box Supervision

Yang, Rui, Song, Lin, Ge, Yixiao, Li, Xiu

arXiv.org Artificial IntelligenceJul-24-2023

Box-supervised instance segmentation has gained much attention as it requires only simple box annotations instead of costly mask or polygon annotations. However, existing box-supervised instance segmentation models mainly focus on mask-based frameworks. We propose a new end-to-end training technique, termed BoxSnake, to achieve effective polygonal instance segmentation using only box annotations for the first time. Our method consists of two loss functions: (1) a point-based unary loss that constrains the bounding box of predicted polygons to achieve coarse-grained segmentation; and (2) a distance-aware pairwise loss that encourages the predicted polygons to fit the object boundaries. Compared with the mask-based weakly-supervised methods, BoxSnake further reduces the performance gap between the predicted segmentation and the bounding box, and shows significant superiority on the Cityscapes dataset. The code has been available publicly.

machine learning, natural language, segmentation, (21 more...)

arXiv.org Artificial Intelligence

2303.1163

Country: Asia > China (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Add feedback

GMM: Delving into Gradient Aware and Model Perceive Depth Mining for Monocular 3D Detection

Mao, Weixin, Yang, Jinrong, Ge, Zheng, Song, Lin, Zhou, Hongyu, Mao, Tiezheng, Li, Zeming, Yoshie, Osamu

arXiv.org Artificial IntelligenceJun-30-2023

Depth perception is a crucial component of monoc-ular 3D detection tasks that typically involve ill-posed problems. In light of the success of sample mining techniques in 2D object detection, we propose a simple yet effective mining strategy for improving depth perception in 3D object detection. Concretely, we introduce a plain metric to evaluate the quality of depth predictions, which chooses the mined sample for the model. Moreover, we propose a Gradient-aware and Model-perceive Mining strategy (GMM) for depth learning, which exploits the predicted depth quality for better depth learning through easy mining. GMM is a general strategy that can be readily applied to several state-of-the-art monocular 3D detectors, improving the accuracy of depth prediction. Extensive experiments on the nuScenes dataset demonstrate that the proposed methods significantly improve the performance of 3D object detection while outperforming other state-of-the-art sample mining techniques by a considerable margin. On the nuScenes benchmark, GMM achieved the state-of-the-art (42.1% mAP and 47.3% NDS) performance in monocular object detection.

artificial intelligence, detection, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2306.1745

Country: Asia > China (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Materials > Metals & Mining (0.68)

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

Sticker820K: Empowering Interactive Retrieval with Stickers

Zhao, Sijie, Ge, Yixiao, Qi, Zhongang, Song, Lin, Ding, Xiaohan, Xie, Zehua, Shan, Ying

arXiv.org Artificial IntelligenceJun-12-2023

Stickers have become a ubiquitous part of modern-day communication, conveying complex emotions through visual imagery. To facilitate the development of more powerful algorithms for analyzing stickers, we propose a large-scale Chinese sticker dataset, namely Sticker820K, which consists of 820k image-text pairs. Each sticker has rich and high-quality textual annotations, including descriptions, optical characters, emotional labels, and style classifications. Although vision-language tasks in the domain of natural images have been well studied, directly applying the those models, such as CLIP, to sticker data is not an optimal solution due to the discrepant nature between natural and emotive image data. Therefore, we propose StickerCLIP as a benchmark model on the Sticker820K dataset. For the text-to-image retrieval task, our StickerCLIP demonstrates strong superiority over the CLIP, which achieves an absolute gain of 66.0\% in mean recall on the Sticker820K test set. Additionally, we endeavor to extend the recently popularized LLM by means of prompt tuning, integrating its ability for sticker retrieval and allowing users to retrieve stickers through instructions. We validate the feasibility of this method, demonstrating the immense potential of prompt tuning in expanding LLM abilities while not affecting the quality of upstream tasks.

machine learning, natural language, sticker, (18 more...)

arXiv.org Artificial Intelligence

2306.0687

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)

Add feedback

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

Yang, Rui, Song, Lin, Li, Yanwei, Zhao, Sijie, Ge, Yixiao, Li, Xiu, Shan, Ying

arXiv.org Artificial IntelligenceMay-30-2023

This paper aims to efficiently enable Large Language Models (LLMs) to use multimodal tools. Advanced proprietary LLMs, such as ChatGPT and GPT-4, have shown great potential for tool usage through sophisticated prompt engineering. Nevertheless, these models typically rely on prohibitive computational costs and publicly inaccessible data. To address these challenges, we propose the GPT4Tools based on self-instruct to enable open-source LLMs, such as LLaMA and OPT, to use tools. It generates an instruction-following dataset by prompting an advanced teacher with various multi-modal contexts. By using the Low-Rank Adaptation (LoRA) optimization, our approach facilitates the open-source LLMs to solve a range of visual problems, including visual comprehension and image generation. Moreover, we provide a benchmark to evaluate the ability of LLMs to use tools, which is performed in both zero-shot and fine-tuning ways. Extensive experiments demonstrate the effectiveness of our method on various language models, which not only significantly improves the accuracy of invoking seen tools, but also enables the zero-shot capacity for unseen tools.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.18752

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback