AITopics | image-text pair

e10a6a906ef323efaf708f76cf3c1d1e-Paper-Conference.pdf

Neural Information Processing SystemsApr-30-2026, 01:34:53 GMT

detection, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country: Europe (0.28)

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)

Add feedback

Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

Neural Information Processing SystemsApr-29-2026, 23:19:24 GMT

Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zeroshot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of bridging semantic gap in pretraining data.

large language model, machine learning, segmentation, (17 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Israel (0.15)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.57)

Add feedback

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Neural Information Processing SystemsApr-25-2026, 21:26:27 GMT

Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Overview

Neural Information Processing SystemsApr-25-2026, 08:28:40 GMT

In this section, we mainly introduce the axiomatic properties of Shapley value. Weber et al. [17] have proved that Shapley value is the unique metric that satisfies the following axioms: Linearity, Symmetry, Dummy, and Efficiency. If two independent games u and v can be linearly merged into one game w(S) = u(S)+v(S), then the Shapley value of each player i N in the new game w is the sum of Shapley values of the player i in the game uand v, which can be formulated as: ϕw(i|N) = ϕu(i|N)+ϕv(i|N) (1) Symmetry Axiom. Considering two players i and j in a game v, if they satisfy: S N \{i,j},v(S {i}) = v(S {j}) (2) then ϕv(i|N) = ϕv(j|N). The dummy player is defined as the player that has no interaction with other players. Formally, if a player i in a game v satisfies: S N \{i},v(S {i}) = v(S)+v({i}) (3) then this player is defined as the dummy player.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Game Theory (0.96)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.47)

Add feedback

23fa71cc32babb7b91130824466d25a5-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 03:28:13 GMT

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.75)

Add feedback

072fd0525592b43da661e254bbaadc27-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 09:49:23 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

072fd0525592b43da661e254bbaadc27-Paper-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 09:49:19 GMT

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Real Text Synthetic Text The dog is resting

Neural Information Processing SystemsApr-24-2026, 07:54:53 GMT

We propose two methods for text-image alignmentchair on the porch of the evaluation: VQ2 and VNLI, demonstrated with example pairs.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Europe (0.93)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.93)

Industry: Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)

Add feedback

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

Neural Information Processing SystemsApr-24-2026, 04:17:19 GMT

We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language (VL) pre-training. The current paradigm uses visual features as prompts to guide language models, with a focus on determining the most relevant visual features for corresponding text. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. We introduce the Prompt-Transformer (P-Former), a model that predicts these ideal prompts, which is trained exclusively on linguistic data, bypassing the need for image-text pairings.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Industry: Education > Curriculum > Subject-Specific Education (0.40)

Technology: