AITopics | Li, Xiujun

Collaborating Authors

Li, Xiujun

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Multimodal Autoregressive Pre-training of Large Vision Encoders

Fini, Enrico, Shukor, Mustafa, Li, Xiujun, Dufter, Philipp, Klein, Michal, Haldimann, David, Aitharaju, Sai, da Costa, Victor Guilherme Turrisi, Béthune, Louis, Gan, Zhe, Toshev, Alexander T, Eichner, Marcin, Nabi, Moin, Yang, Yinfei, Susskind, Joshua M., El-Nouby, Alaaeldin

arXiv.org Artificial IntelligenceNov-21-2024

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

benchmark, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2411.14402

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Li, Zhangheng, You, Keen, Zhang, Haotian, Feng, Di, Agrawal, Harsh, Li, Xiujun, Moorthy, Mohana Prasad Sathya, Nichols, Jeff, Yang, Yinfei, Gan, Zhe

arXiv.org Artificial IntelligenceOct-24-2024

Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.18967

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

Lu, Yujie, Li, Xiujun, Fu, Tsu-Jui, Eckstein, Miguel, Wang, William Yang

arXiv.org Artificial IntelligenceMay-23-2024

The rapid progress in Multimodal Large Language Models (MLLMs) has significantly advanced their ability to process and understand complex visual and textual information. However, the integration of multiple images and extensive textual contexts remains a challenge due to the inherent limitation of the models' capacity to handle long input sequences efficiently. In this paper, we introduce SEEKER, a multimodal large language model designed to tackle this issue. SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images, enabling the model to handle long text within a fixed token-length budget efficiently. Our empirical experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach, and is more efficient in understanding long-form multimodal input and generating long-form textual output, outperforming all existing proprietary and open-source MLLMs by large margins.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2405.14213

Country:

North America > United States > California (0.14)
Europe > Spain (0.14)
Europe > Italy (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following

Lu, Yujie, Li, Xiujun, Wang, William Yang, Choi, Yejin

arXiv.org Artificial IntelligenceNov-29-2023

We introduce VISUAL EMBEDDED INSTRUCTION (VIM), a new framework designed to evaluate the visual instruction following capability of Multimodal Large Language Models (MLLMs). As illustrated in Figure 2, VIM challenges the MLLMs by embedding the instructions into the visual scenes, demanding strong visual interpretative skills for instruction following. We adapt VIM to various benchmarks, including VQAv2, MME, MM-Vet, and RefCOCO series, compose a VIM bench, and probe diverse MLLMs across three distinct in-context learning settings: Zero Shot, One Shot, and Pair Shot. We observe that there is a significant performance disparity between the open-source MLLMs and GPT-4V, implying that their proficiency in visual instruction comprehension is not up to par. Our results highlight a promising direction for the enhancement of MLLMs capabilities on instruction following. We aim VIM to serve as a useful norm for advancing the state of the art and driving further progress in the field.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2311.17647

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation

Lu, Yujie, Yang, Xianjun, Li, Xiujun, Wang, Xin Eric, Wang, William Yang

arXiv.org Artificial IntelligenceMay-18-2023

Existing automatic evaluation on text-to-image synthesis can only provide an image-text matching score, without considering the object-level compositionality, which results in poor correlation with human judgments. In this work, we propose LLMScore, a new framework that offers evaluation scores with multi-granularity compositionality. LLMScore leverages the large language models (LLMs) to evaluate text-to-image models. Initially, it transforms the image into image-level and object-level visual descriptions. Then an evaluation instruction is fed into the LLMs to measure the alignment between the synthesized image and the text, ultimately generating a score accompanied by a rationale. Our substantial analysis reveals the highest correlation of LLMScore with human judgments on a wide range of datasets (Attribute Binding Contrast, Concept Conjunction, MSCOCO, DrawBench, PaintSkills). Notably, our LLMScore achieves Kendall's tau correlation with human evaluations that is 58.8% and 31.2% higher than the commonly-used text-image matching metrics CLIP and BLIP, respectively.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.11116

Country:

Europe (0.67)
North America > United States > California (0.28)

Genre: Research Report (1.00)

Industry: Information Technology (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Self-supervised Pre-training with Hard Examples Improves Visual Representations

Li, Chunyuan, Li, Xiujun, Zhang, Lei, Peng, Baolin, Zhou, Mingyuan, Gao, Jianfeng

arXiv.org Artificial IntelligenceJan-3-2021

Self-supervised pre-training (SSP) employs random image transformations to generate training data for visual representation learning. In this paper, we first present a modeling framework that unifies existing SSP methods as learning to predict pseudo-labels. Then, we propose new data augmentation methods of generating training examples whose pseudo-labels are harder to predict than those generated via random image transformations. Specifically, we use adversarial training and CutMix to create hard examples (HEXA) to be used as augmented views for MoCo-v2 and DeepCluster-v2, leading to two variants HEXA_{MoCo} and HEXA_{DCluster}, respectively. In our experiments, we pre-train models on ImageNet and evaluate them on multiple public benchmarks. Our evaluation shows that the two new algorithm variants outperform their original counterparts, and achieve new state-of-the-art on a wide range of tasks where limited task supervision is available for fine-tuning. These results verify that hard examples are instrumental in improving the generalization of the pre-trained models.

inductive learning, neural network, representation, (18 more...)

arXiv.org Artificial Intelligence

2012.13493

Country: North America > United States > Texas (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Add feedback

VinVL: Making Visual Representations Matter in Vision-Language Models

Zhang, Pengchuan, Li, Xiujun, Hu, Xiaowei, Yang, Jianwei, Zhang, Lei, Wang, Lijuan, Choi, Yejin, Gao, Jianfeng

arXiv.org Artificial IntelligenceJan-2-2021

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.

dataset, image understanding, neural network, (19 more...)

arXiv.org Artificial Intelligence

2101.00529

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Budgeted Policy Learning for Task-Oriented Dialogue Systems

Zhang, Zhirui, Li, Xiujun, Gao, Jianfeng, Chen, Enhong

arXiv.org Artificial IntelligenceJun-2-2019

This paper presents a new approach that extends Deep Dyna-Q (DDQ) by incorporating a Budget-Conscious Scheduling (BCS) to best utilize a fixed, small amount of user interactions (budget) for learning task-oriented dialogue agents. BCS consists of (1) a Poisson-based global scheduler to allocate budget over different stages of training; (2) a controller to decide at each training step whether the agent is trained using real or simulated experiences; (3) a user goal sampling module to generate the experiences that are most effective for policy learning. Experiments on a movie-ticket booking task with simulated and real users show that our approach leads to significant improvements in success rate over the state-of-the-art baselines given the fixed budget.

agent, deep learning, neural network, (22 more...)

arXiv.org Artificial Intelligence

1906.00499

Genre: Research Report (1.00)

Industry: Media > Film (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.69)

Add feedback

ConvLab: Multi-Domain End-to-End Dialog System Platform

Lee, Sungjin, Zhu, Qi, Takanobu, Ryuichi, Li, Xiang, Zhang, Yaoqin, Zhang, Zheng, Li, Jinchao, Peng, Baolin, Li, Xiujun, Huang, Minlie, Gao, Jianfeng

arXiv.org Artificial IntelligenceApr-18-2019

We present ConvLab, an open-source multi-domain end-to-end dialog system platform, that enables researchers to quickly set up experiments with reusable components and compare a large set of different approaches, ranging from conventional pipeline systems to end-to-end neural models, in common environments. ConvLab offers a set of fully annotated datasets and associated pre-trained reference models. As a showcase, we extend the MultiWOZ dataset with user dialog act annotations to train all component models and demonstrate how ConvLab makes it easy and effortless to conduct complicated experiments in multi-domain end-to-end dialog settings.

artificial intelligence, convlab, natural language, (18 more...)

arXiv.org Artificial Intelligence

1904.08637

Country:

Europe (0.14)
Asia (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization

Zhang, Yizhe, Galley, Michel, Gao, Jianfeng, Gan, Zhe, Li, Xiujun, Brockett, Chris, Dolan, Bill

Neural Information Processing SystemsDec-31-2018

Responses generated by neural conversational models tend to lack informativeness and diversity. We present a novel adversarial learning method, called Adversarial Information Maximization (AIM) model, to address these two related but distinct problems. To foster response diversity, we leverage adversarial training that allows distributional matching of synthetic and real responses. To improve informativeness, we explicitly optimize a variational lower bound on pairwise mutual information between query and response. Empirical results from automatic and human evaluations demonstrate that our methods significantly boost informativeness and diversity.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
North America > Canada (0.14)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback