AITopics | sharegpt4v

Collaborating Authors

sharegpt4v

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Unicorn: Text-Only Data Synthesis for Vision Language Model Training

Yu, Xiaomin, Ding, Pengxiang, Zhang, Wenjie, Huang, Siteng, Gao, Songyang, Qin, Chengwei, Wu, Kejian, Fan, Zhaoxin, Qiao, Ziyue, Wang, Donglin

arXiv.org Artificial IntelligenceMar-28-2025

Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training. Code is available at https://github.com/Yu-xm/Unicorn.git.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.22655

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

LongProLIP: A Probabilistic Vision-Language Model with Long Context Text

Chun, Sanghyuk, Yun, Sangdoo

arXiv.org Artificial IntelligenceMar-13-2025

Recently, Probabilistic Language-Image Pre-Training (ProLIP) has been proposed to tackle the multiplicity issue of vision-language (VL) tasks. Despite their success in probabilistic representation learning at a scale, the ProLIP models cannot handle long context texts longer than 64 context length, which limits their ability to capture rich contextual information from longer text sequences. To address this issue, this paper proposes a fine-tuning strategy for ProLIP to accept longer texts, e.g., 256 text tokens. Experimental results on Urban-1k and the DataComp evaluation suite show that the proposed LongProLIP recipe can improve understanding of long contexts while minimizing the negative effect of fine-tuning.We also observe a trade-off between the long context understanding (measured by Urban-1k) and general zero-shot capability (measured by evaluation datasets by DataComp). Code is available at https://github.com/naver-ai/prolip

computer vision, dataset, longprolip, (11 more...)

arXiv.org Artificial Intelligence

2503.08048

Country: Europe > Spain > Andalusia > Granada Province > Granada (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts

Wu, Xuyang, Wang, Yuan, Wu, Hsin-Tai, Tao, Zhiqiang, Fang, Yi

arXiv.org Artificial IntelligenceJun-25-2024

Large vision-language models (LVLMs) have recently achieved significant progress, demonstrating strong capabilities in open-world visual understanding. However, it is not yet clear how LVLMs address demographic biases in real life, especially the disparities across attributes such as gender, skin tone, and age. In this paper, we empirically investigate \emph{visual fairness} in several mainstream LVLMs and audit their performance disparities across sensitive demographic attributes, based on public fairness benchmark datasets (e.g., FACET). To disclose the visual bias in LVLMs, we design a fairness evaluation framework with direct questions and single-choice question-instructed prompts on visual question-answering/classification tasks. The zero-shot prompting results indicate that, despite enhancements in visual understanding, both open-source and closed-source LVLMs exhibit prevalent fairness issues across different instruct prompts and demographic attributes.

llava-1, lvlm, sharegpt4v, (12 more...)

arXiv.org Artificial Intelligence

2406.17974

Country:

North America > United States > California > Santa Clara County > Santa Clara (0.05)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(5 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback