AITopics | Yue, Xiangyu

Plotting

Yue, Xiangyu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

Zhang, Yiyuan, Ding, Xiaohan, Gong, Kaixiong, Ge, Yixiao, Shan, Ying, Yue, Xiangyu

arXiv.org Artificial IntelligenceJan-25-2024

We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it, we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way, we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation, we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization, which exploits the auxiliary weights without any inference costs. On the image, point cloud, video, and audio recognition tasks, we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2401.14405

Country:

Asia > Middle East > Israel (0.14)
Asia > China (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

Zhou, Zhanhui, Liu, Jie, Yang, Chao, Shao, Jing, Liu, Yu, Yue, Xiangyu, Ouyang, Wanli, Qiao, Yu

arXiv.org Artificial IntelligenceDec-15-2023

A single language model (LM), despite aligning well with an average labeler through reinforcement learning from human feedback (RLHF), may not universally suit diverse human preferences. Recent approaches therefore opt for customization by collecting multi-dimensional feedback and creating distinct reward models (RMs) for each dimension (e.g., helpfulness, harmlessness, or honesty). Different LMs can then be optimized for different preferences using multi-objective RLHF (MORLHF) with different reward weightings. Yet, RL fine-tuning is unstable and resource-heavy, especially for MORLHF with diverse and usually conflicting objectives. In this paper, we present Multi-Objective Direct Preference Optimization (MODPO), an RL-free algorithm that extends Direct Preference Optimization (DPO) for multiple alignment objectives with minimal overheads. Essentially, MODPO folds language modeling directly into reward modeling, training LMs as implicit collective reward models (cRMs) that combine all objectives with specific weightings. While theoretically guaranteed to produce the same optimal solutions as MORLHF, MODPO is practically more stable and computationally efficient. Empirical results from safety alignment and long-form question answering confirm that MODPO matches or outperforms existing methods, consistently producing a Pareto front of LMs that cater to diverse preferences with 3 times less computational resources compared to MORLHF.

large language model, machine learning, natural language, (23 more...)

arXiv.org Artificial Intelligence

2310.03708

Country:

Asia > India (1.00)
Europe (0.92)
North America > Canada (0.68)
North America > United States > California (0.27)

Genre: Personal (0.67)

Industry:

Media > Film (1.00)
Leisure & Entertainment > Sports > Soccer (1.00)
Leisure & Entertainment > Sports > Football (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

OneLLM: One Framework to Align All Modalities with Language

Han, Jiaming, Gong, Kaixiong, Zhang, Yiyuan, Wang, Jiaqi, Zhang, Kaipeng, Lin, Dahua, Qiao, Yu, Gao, Peng, Yue, Xiangyu

arXiv.org Artificial IntelligenceDec-6-2023

Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2312.037

Country:

North America > United States (0.14)
Asia > China (0.14)

Genre:

Research Report (0.82)
Personal > Interview (0.46)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.48)
Health & Medicine > Health Care Technology (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Online Vectorized HD Map Construction using Geometry

Zhang, Zhixin, Zhang, Yiyuan, Ding, Xiaohan, Jin, Fusheng, Yue, Xiangyu

arXiv.org Artificial IntelligenceDec-6-2023

The construction of online vectorized High-Definition (HD) maps is critical for downstream prediction and planning. Recent efforts have built strong baselines for this task, however, shapes and relations of instances in urban road systems are still under-explored, such as parallelism, perpendicular, or rectangle-shape. In our work, we propose GeMap ($\textbf{Ge}$ometry $\textbf{Map}$), which end-to-end learns Euclidean shapes and relations of map instances beyond basic perception. Specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations. We also decouple self-attention to independently handle Euclidean shapes and relations. Our method achieves new state-of-the-art performance on the NuScenes and Argoverse 2 datasets. Remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time. Code is available at https://github.com/cnzzx/GeMap

artificial intelligence, displacement vector, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2312.03341

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Industry:

Transportation > Ground > Road (0.50)
Transportation > Infrastructure & Services (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

Ding, Xiaohan, Zhang, Yiyuan, Ge, Yixiao, Zhao, Sijie, Song, Lin, Yue, Xiangyu, Shan, Ying

arXiv.org Artificial IntelligenceNov-27-2023

Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention, but there are two unresolved and critical issues that demand further investigation. 1) The architectures of existing large-kernel ConvNets largely follow the design principles of conventional ConvNets or transformers, while the architectural design for large-kernel ConvNets remains under-addressed. 2) As transformers have dominated multiple modalities, it remains to be investigated whether ConvNets also have a strong universal perception ability in domains beyond vision. In this paper, we contribute from two aspects. 1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep. Following such guidelines, our proposed large-kernel ConvNet shows leading performance in image recognition. For example, our models achieve an ImageNet accuracy of 88.0%, ADE20K mIoU of 55.6%, and COCO box AP of 56.4%, demonstrating better performance and higher speed than a number of recently proposed powerful competitors. 2) We discover that large kernels are the key to unlocking the exceptional performance of ConvNets in domains where they were originally not proficient. With certain modality-related preprocessing approaches, the proposed model achieves state-of-the-art performance on time-series forecasting and audio recognition tasks even without modality-specific customization to the architecture. Code and all the models at https://github.com/AILab-CVC/UniRepLKNet.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2311.15599

Country:

North America > United States (0.28)
Europe (0.28)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Towards Unified and Effective Domain Generalization

Zhang, Yiyuan, Gong, Kaixiong, Ding, Xiaohan, Zhang, Kaipeng, Lv, Fangrui, Keutzer, Kurt, Yue, Xiangyu

arXiv.org Artificial IntelligenceOct-15-2023

We propose $\textbf{UniDG}$, a novel and $\textbf{Uni}$fied framework for $\textbf{D}$omain $\textbf{G}$eneralization that is capable of significantly enhancing the out-of-distribution generalization performance of foundation models regardless of their architectures. The core idea of UniDG is to finetune models during the inference stage, which saves the cost of iterative training. Specifically, we encourage models to learn the distribution of test data in an unsupervised manner and impose a penalty regarding the updating step of model parameters. The penalty term can effectively reduce the catastrophic forgetting issue as we would like to maximally preserve the valuable knowledge in the original model. Empirically, across 12 visual backbones, including CNN-, MLP-, and Transformer-based models, ranging from 1.89M to 303M parameters, UniDG shows an average accuracy improvement of +5.4% on DomainBed. These performance results demonstrate the superiority and versatility of UniDG. The code is publicly available at https://github.com/invictus717/UniDG

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2310.10008

Country: Asia > Middle East > Israel (0.14)

Genre: Research Report > Promising Solution (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
(3 more...)

Add feedback

ImageBind-LLM: Multi-modality Instruction Tuning

Han, Jiaming, Zhang, Renrui, Shao, Wenqi, Gao, Peng, Xu, Peng, Xiao, Han, Zhang, Kaipeng, Liu, Chris, Wen, Song, Guo, Ziyu, Lu, Xudong, Ren, Shuai, Wen, Yafei, Chen, Xiaoxin, Yue, Xiangyu, Li, Hongsheng, Qiao, Yu

arXiv.org Artificial IntelligenceSep-11-2023

We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder. Then, the image features transformed by the bind network are added to word tokens of all layers in LLaMA, which progressively injects visual instructions via an attention-free and zero-initialized gating mechanism. Aided by the joint embedding of ImageBind, the simple image-text training enables our model to exhibit superior multi-modality instruction-following capabilities. During inference, the multi-modality inputs are fed into the corresponding ImageBind encoders, and processed by a proposed visual cache model for further cross-modal embedding enhancement. The training-free cache model retrieves from three million image features extracted by ImageBind, which effectively mitigates the training-inference modality discrepancy. Notably, with our approach, ImageBind-LLM can respond to instructions of diverse modalities and demonstrate significant language generation quality. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2309.03905

Country: Asia > China (0.46)

Genre: Research Report (0.40)

Industry:

Media > Film (1.00)
Leisure & Entertainment > Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models

Zheng, Zangwei, Ma, Mingyuan, Wang, Kai, Qin, Ziheng, Yue, Xiangyu, You, Yang

arXiv.org Artificial IntelligenceAug-11-2023

Continual learning (CL) can help pre-trained vision-language models efficiently adapt to new or under-trained data distributions without re-training. Nevertheless, during the continual training of the Contrastive Language-Image Pre-training (CLIP) model, we observe that the model's zero-shot transfer ability significantly degrades due to catastrophic forgetting. Existing CL methods can mitigate forgetting by replaying previous data. However, since the CLIP dataset is private, replay methods cannot access the pre-training dataset. In addition, replaying data of previously learned downstream tasks can enhance their performance but comes at the cost of sacrificing zero-shot performance. To address this challenge, we propose a novel method ZSCL to prevent zero-shot transfer degradation in the continual learning of vision-language models in both feature and parameter space. In the feature space, a reference dataset is introduced for distillation between the current and initial models. The reference dataset should have semantic diversity but no need to be labeled, seen in pre-training, or matched image-text pairs. In parameter space, we prevent a large parameter shift by averaging weights during the training. We propose a more challenging Multi-domain Task Incremental Learning (MTIL) benchmark to evaluate different methods, where tasks are from various domains instead of class-separated in a single dataset. Our method outperforms other methods in the traditional class-incremental learning setting and the MTIL by 9.7% average score. Our code locates at https://github.com/Thunderbeee/ZSCL.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2303.06628

Country:

Asia (0.14)
Oceania > Australia (0.14)

Genre: Research Report (0.84)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Beating Backdoor Attack at Its Own Game

Liu, Min, Sangiovanni-Vincentelli, Alberto, Yue, Xiangyu

arXiv.org Artificial IntelligenceAug-4-2023

Deep neural networks (DNNs) are vulnerable to backdoor attack, which does not affect the network's performance on clean data but would manipulate the network behavior once a trigger pattern is added. Existing defense methods have greatly reduced attack success rate, but their prediction accuracy on clean data still lags behind a clean model by a large margin. Inspired by the stealthiness and effectiveness of backdoor attack, we propose a simple but highly effective defense framework which injects non-adversarial backdoors targeting poisoned samples. Following the general steps in backdoor attack, we detect a small set of suspected samples and then apply a poisoning strategy to them. The non-adversarial backdoor, once triggered, suppresses the attacker's backdoor on poisoned data, but has limited influence on clean data. The defense can be carried out during data preprocessing, without any modification to the standard end-to-end training pipeline. We conduct extensive experiments on multiple benchmarks with different architectures and representative attacks. Results demonstrate that our method achieves state-of-the-art defense effectiveness with by far the lowest performance drop on clean data. Considering the surprising defense ability displayed by our framework, we call for more attention to utilizing backdoor for backdoor defense. Code is available at https://github.com/damianliumin/non-adversarial_backdoor.

artificial intelligence, backdoor, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2307.15539

Country: Asia (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Meta-Transformer: A Unified Framework for Multimodal Learning

Zhang, Yiyuan, Gong, Kaixiong, Zhang, Kaipeng, Li, Hongsheng, Qiao, Yu, Ouyang, Wanli, Yue, Xiangyu

arXiv.org Artificial IntelligenceJul-20-2023

Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities ($\textit{e.g.}$ natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a $\textbf{frozen}$ encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12 modalities with unpaired data. Experiments on different benchmarks reveal that Meta-Transformer can handle a wide range of tasks including fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Meta-Transformer indicates a promising future for developing unified multimodal intelligence with transformers. Code will be available at https://github.com/invictus717/MetaTransformer

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2307.10802

Country:

North America > United States (0.46)
Asia > Middle East > Israel (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(5 more...)

Add feedback