AITopics | visual concept

Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

Neural Information Processing SystemsApr-29-2026, 23:19:24 GMT

Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zeroshot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of bridging semantic gap in pretraining data.

large language model, machine learning, segmentation, (17 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Israel (0.15)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.57)

Add feedback

4cef5b5e6ff1b3445db4c013f1d452e0-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 19:28:29 GMT

artificial intelligence, machine learning, natural language, (14 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Overview

Neural Information Processing SystemsApr-25-2026, 08:28:40 GMT

In this section, we mainly introduce the axiomatic properties of Shapley value. Weber et al. [17] have proved that Shapley value is the unique metric that satisfies the following axioms: Linearity, Symmetry, Dummy, and Efficiency. If two independent games u and v can be linearly merged into one game w(S) = u(S)+v(S), then the Shapley value of each player i N in the new game w is the sum of Shapley values of the player i in the game uand v, which can be formulated as: ϕw(i|N) = ϕu(i|N)+ϕv(i|N) (1) Symmetry Axiom. Considering two players i and j in a game v, if they satisfy: S N \{i,j},v(S {i}) = v(S {j}) (2) then ϕv(i|N) = ϕv(j|N). The dummy player is defined as the player that has no interaction with other players. Formally, if a player i in a game v satisfies: S N \{i},v(S {i}) = v(S)+v({i}) (3) then this player is defined as the dummy player.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Game Theory (0.96)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.47)

Add feedback

27d52bcb3580724eb4cbe9f2718a9365-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 04:56:18 GMT

artificial intelligence, focus area, machine learning, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

Neural Information Processing SystemsMar-21-2026, 19:16:48 GMT

Compositionality is a critical capability in Text-to-Image (T2I) models, as it reflects their ability to understand and combine multiple concepts from text descriptions. Existing evaluations of compositional capability rely heavily on human-designed text prompts or fixed templates, limiting their diversity and complexity, and yielding low discriminative power. We propose ConceptMix, a scalable, controllable, and customizable benchmark which automatically evaluates compositional generation ability of T2I models. This is done in two stages. First, ConceptMix generates the text prompts: concretely, using categories of visual concepts (e.g., objects, colors, shapes, spatial relationships), it randomly samples an object and k-tuples of visual concepts, then uses GPT-4o to generate text prompts for image generation based on these sampled concepts.

artificial intelligence, machine learning, proceedings, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.61)

Add feedback

Visual Concepts Tokenization

Neural Information Processing SystemsMar-20-2026, 06:13:30 GMT

Obtaining the human-like perception ability of abstracting visual concepts from concrete pixels has always been a fundamental and important target in machine learning research fields such as disentangled representation learning and scene decomposition. Towards this goal, we propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens, with each concept token responding to one type of independent visual concept. Particularly, to obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens, preventing information leakage across concept tokens. We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts. The cross-attention and disentangling loss play the role of induction and mutual exclusion for the concept tokens, respectively. Extensive experiments on several popular datasets verify the effectiveness of VCT on the tasks of disentangled representation learning and scene decomposition. VCT achieves the state of the art results by a large margin.

artificial intelligence, concept token, machine learning, (7 more...)

Neural Information Processing Systems

Country: Asia > China > Guangxi Province > Nanning (0.07)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.78)

Add feedback

Partially-Supervised Image Captioning

Neural Information Processing SystemsMar-17-2026, 01:05:07 GMT

Image captioning models are becoming increasingly successful at describing the content of images in restricted domains. However, if these models are to function in the wild --- for example, as assistants for people with impaired vision --- a much larger number and variety of visual concepts must be understood. To address this problem, we teach image captioning models new visual concepts from labeled images and object detection datasets. Since image labels and object classes can be interpreted as partial captions, we formulate this problem as learning from partially-specified sequence data. We then propose a novel algorithm for training sequence models, such as recurrent neural networks, on partially-specified sequences which we represent using finite state automata. In the context of image captioning, our method lifts the restriction that previously required image captioning models to be trained on paired image-sentence corpora only, or otherwise required specialized model architectures to take advantage of alternative data modalities. Applying our approach to an existing neural captioning model, we achieve state of the art results on the novel object captioning task using the COCO dataset. We further show that we can train a captioning model to describe new visual concepts from the Open Images dataset while maintaining competitive COCO evaluation scores.

artificial intelligence, machine learning, object-oriented architecture, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.60)

Add feedback

Visual Concept-Metaconcept Learning

Chi Han, Jiayuan Mao, Chuang Gan, Josh Tenenbaum, Jiajun Wu

Neural Information Processing SystemsFeb-19-2026, 19:59:09 GMT

Humans reason with concepts and metaconcepts: we recognizered and green from visual input; we also understand that theydescribe the same property of objects(i.e.,thecolor).

machine learning, metaconcept, natural language, (18 more...)

Neural Information Processing Systems

Country: