AITopics | Cui, Junbo

Plotting

Cui, Junbo

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Yu, Shi, Tang, Chaoyue, Xu, Bokai, Cui, Junbo, Ran, Junhao, Yan, Yukun, Liu, Zhenghao, Wang, Shuo, Han, Xu, Liu, Zhiyuan, Sun, Maosong

arXiv.org Artificial IntelligenceOct-14-2024

Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in realworld multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 25-39% end-to-end performance gain over traditional textbased RAG pipeline. Further analysis reveals that VisRAG is effective in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at https://github.com/openbmb/visrag. Trained on massive data, large language models (LLMs) like GPT-4 (Achiam et al., 2023) have shown strong abilities in common NLP tasks using their parametric knowledge (Wei et al., 2022; Zhao et al., 2023). Retrieval-augmented generation (RAG) alleviates this problem by using a knowledge retriever, which has access to a custom outer knowledge base, to supply the LLM with the necessary information for generating outputs (Guu et al., 2020; Lewis et al., 2020; Yu et al., 2023). Opensource RAG frameworks like llamaindex (Liu, 2022) have been developed to facilitate the research and deployment of common RAG pipelines. Typical retrieval-augmented generation (RAG) pipelines are text-based, operating on segmented texts as retrieval units (Yu et al., 2023; Asai et al., 2024; Yan et al., 2024), which we refer to as TextRAG.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2410.10594

Country: North America > United States (0.67)

Genre: Research Report > Promising Solution (0.34)

Industry: Transportation > Air (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

GUICourse: From General Vision Language Models to Versatile GUI Agents

Chen, Wentong, Cui, Junbo, Hu, Jinyi, Qin, Yujia, Fang, Junjie, Zhao, Yue, Wang, Chongyi, Liu, Jun, Chen, Guirong, Huo, Yupeng, Yao, Yuan, Lin, Yankai, Liu, Zhiyuan, Sun, Maosong

arXiv.org Artificial IntelligenceJun-17-2024

Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2406.11317

Country:

Asia > China (0.46)
North America > United States (0.28)
North America > Canada > Ontario (0.14)

Genre:

Research Report (1.00)
Instructional Material (0.67)

Industry:

Education > Educational Setting > Online (0.68)
Information Technology > Services (0.46)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Xu, Ruyi, Yao, Yuan, Guo, Zonghao, Cui, Junbo, Ni, Zanlin, Ge, Chunjiang, Chua, Tat-Seng, Liu, Zhiyuan, Sun, Maosong, Huang, Gao

arXiv.org Artificial IntelligenceMar-18-2024

Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2403.11703

Genre: Research Report (1.00)

Industry: Consumer Products & Services > Restaurants (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback