AITopics | Liu, Jihao

Plotting

Liu, Jihao

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

Liu, Jihao, Huang, Xin, Zheng, Jinliang, Liu, Boxiao, Wang, Jia, Yoshie, Osamu, Liu, Yu, Li, Hongsheng

arXiv.org Artificial IntelligenceJun-28-2024

This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2406.19736

Country: North America > United States (0.14)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Add feedback

Instruction-Guided Visual Masking

Zheng, Jinliang, Li, Jianxiong, Cheng, Sijie, Zheng, Yinan, Li, Jiaming, Liu, Jihao, Liu, Yu, Liu, Jingjing, Zhan, Xianyuan

arXiv.org Artificial IntelligenceMay-30-2024

Instruction following is crucial in contemporary LLM. However, when extended to multimodal setting, it often suffers from misalignment between specific textual instruction and targeted local region of an image. To achieve more accurate and nuanced multimodal instruction following, we introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plug-and-play tool, significantly boosts the performance of diverse multimodal models, yielding new state-of-the-art results across challenging multimodal benchmarks. Code is available at https://github.com/2toinf/IVM.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2405.19783

Country:

Europe > Netherlands (0.14)
Asia (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

Li, Jianxiong, Zheng, Jinliang, Zheng, Yinan, Mao, Liyuan, Hu, Xiao, Cheng, Sijie, Niu, Haoyi, Liu, Jihao, Liu, Yu, Liu, Jingjing, Zhang, Ya-Qin, Zhan, Xianyuan

arXiv.org Artificial IntelligenceMay-23-2024

Multimodal pretraining is an effective strategy for the trinity of goals of representation learning in autonomous robots: 1) extracting both local and global task progressions; 2) enforcing temporal consistency of visual representation; 3) capturing trajectory-level language grounding. Most existing methods approach these via separate objectives, which often reach sub-optimal solutions. In this paper, we propose a universal unified objective that can simultaneously extract meaningful task progression information from image sequences and seamlessly align them with language instructions. We discover that via implicit preferences, where a visual trajectory inherently aligns better with its corresponding language instruction than mismatched pairs, the popular Bradley-Terry model can transform into representation learning through proper reward reparameterizations. The resulted framework, DecisionNCE, mirrors an InfoNCE-style objective but is distinctively tailored for decision-making tasks, providing an embodied representation learning framework that elegantly extracts both local and global task progression features, with temporal consistency enforced through implicit time contrastive learning, while ensuring trajectory-level instruction grounding via multimodal joint encoding. Evaluation on both simulated and real robots demonstrates that DecisionNCE effectively facilitates diverse downstream policy learning tasks, offering a versatile solution for unified representation and reward learning. Project Page: https://2toinf.github.io/DecisionNCE/

artificial intelligence, decisionnce, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2402.18137

Country:

Asia (0.28)
Europe > Austria > Vienna (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Liu, Jihao, Wang, Tai, Liu, Boxiao, Zhang, Qihang, Liu, Yu, Li, Hongsheng

arXiv.org Artificial IntelligenceAug-28-2023

Multi-view camera-based 3D detection is a challenging problem in computer vision. Recent works leverage a pretrained LiDAR detection model to transfer knowledge to a camera-based student network. However, we argue that there is a major domain gap between the LiDAR BEV features and the camera-based BEV features, as they have different characteristics and are derived from different sources. In this paper, we propose Geometry Enhanced Masked Image Modeling (GeoMIM) to transfer the knowledge of the LiDAR model in a pretrain-finetune paradigm for improving the multi-view camera-based 3D detection. GeoMIM is a multi-camera vision transformer with Cross-View Attention (CVA) blocks that uses LiDAR BEV features encoded by the pretrained BEV model as learning targets. During pretraining, GeoMIM's decoder has a semantic branch completing dense perspective-view features and the other geometry branch reconstructing dense perspective-view depth maps. The depth branch is designed to be camera-aware by inputting the camera's parameters for better transfer capability. Extensive results demonstrate that GeoMIM outperforms existing methods on nuScenes benchmark, achieving state-of-the-art performance for camera-based 3D object detection and 3D segmentation. Code and pretrained models are available at https://github.com/Sense-X/GeoMIM.

artificial intelligence, detection, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2303.11325

Country:

Asia > China (0.28)
Asia > Middle East > Israel (0.15)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.34)

Add feedback

INTERN: A New Learning Paradigm Towards General Vision

Shao, Jing, Chen, Siyu, Li, Yangguang, Wang, Kun, Yin, Zhenfei, He, Yinan, Teng, Jianing, Sun, Qinghong, Gao, Mengya, Liu, Jihao, Huang, Gengshi, Song, Guanglu, Wu, Yichao, Huang, Yuming, Liu, Fenggang, Peng, Huan, Qin, Shuo, Wang, Chengyu, Wang, Yujie, He, Conghui, Liang, Ding, Liu, Yu, Yu, Fengwei, Yan, Junjie, Lin, Dahua, Wang, Xiaogang, Qiao, Yu

arXiv.org Artificial IntelligenceNov-16-2021

Enormous waves of technological innovations over the past several years, marked by the advances in AI technologies, are profoundly reshaping the industry and the society. However, down the road, a key challenge awaits us, that is, our capability of meeting rapidly-growing scenario-specific demands is severely limited by the cost of acquiring a commensurate amount of training data. This difficult situation is in essence due to limitations of the mainstream learning paradigm: we need to train a new model for each new scenario, based on a large quantity of well-annotated data and commonly from scratch. In tackling this fundamental problem, we move beyond and develop a new learning paradigm named INTERN. By learning with supervisory signals from multiple sources in multiple stages, the model being trained will develop strong generalizability. We evaluate our model on 26 well-known datasets that cover four categories of tasks in computer vision. In most cases, our models, adapted with only 10% of the training data in the target domain, outperform the counterparts trained with the full set of data, often by a significant margin. This is an important step towards a promising prospect where such a model with general vision capability can dramatically reduce our reliance on data, thus expediting the adoption of AI technologies. Furthermore, revolving around our new paradigm, we also introduce a new data system, a new architecture, and a new benchmark, which, together, form a general vision ecosystem to support its future development in an open and inclusive manner.

artificial intelligence, arxiv preprint arxiv, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2111.08687

Country: Asia > China (0.14)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

FNAS: Uncertainty-Aware Fast Neural Architecture Search

Liu, Jihao, Zhang, Ming, Sun, Yangting, Liu, Boxiao, Song, Guanglu, Liu, Yu, Li, Hongsheng

arXiv.org Artificial IntelligenceMay-27-2021

Reinforcement learning (RL)-based neural architecture search (NAS) generally guarantees better convergence yet suffers from the requirement of huge computational resources compared with gradient-based approaches, due to the rollout bottleneck - exhaustive training of each sampled architecture on the proxy tasks. In this paper, we propose a general pipeline to accelerate the convergence of the rollout process as well as the RL process in NAS. It is motivated by the interesting observation that both the architecture and the parameter knowledge can be transferred between different search processes and even different tasks. We first introduce an uncertainty-aware critic (value function) in Proximal Policy Optimization (PPO) [27] to take advantage of the architecture knowledge in previous search processes, which stabilizes the training process and reduce the searching time by 4 times. In addition, an architecture knowledge pool together with a block similarity function is proposed to utilize parameter knowledge and reduces the searching time by 2 times. To the best of our knowledge, this is the first method that introduces a block-level weight sharing scheme in RL-based NAS. The block similarity function guarantees a 100% hit ratio with strict fairness [5]. Besides, we show that an off-policy correction factor used in "replay buffer" of RL optimization can further reduce half of the searching time. Experiments on the Mobile Neural Architecture Search (MNAS) [30] search space show that the proposed Fast Neural Architecture Search (FNAS) accelerates the standard RL-based NAS process by 10x (e.g., 20,000 GPU hours to 2,000 GPU hours for MNAS), and guarantees better performance on various vision tasks.

architecture, deep learning, neural network, (17 more...)

arXiv.org Artificial Intelligence

2105.11694

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback