AITopics | cross-modal alignment

Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation

Neural Information Processing SystemsJun-21-2026, 08:47:10 GMT

Multimodal Dataset Distillation (MDD) seeks to condense large-scale image-text datasets into compact surrogates while retaining their effectiveness for cross-modal learning. Despite recent progress, existing MDD approaches often suffer from Modality Collapse, characterized by over-concentrated intra-modal representations and enlarged distributional gap across modalities. In this paper, for the first time, we identify this issue as stemming from a fundamental conflict between the over-compression behavior inherent in dataset distillation and the cross-modal supervision imposed by contrastive objectives. To alleviate modality collapse, we introduce RepBlend, a novel MDD framework that weakens overdominant cross-modal supervision via representation blending, thereby significantly enhancing intra-modal diversity. Additionally, we observe that current MDD methods impose asymmetric supervision across modalities, resulting in biased optimization. To address this, we propose symmetric projection trajectory matching, which synchronizes the optimization dynamics using modality-specific projection heads, thereby promoting balanced supervision and enhancing cross-modal alignment. Experiments on Flickr-30K and MS-COCO show that RepBlend consistently outperforms prior state-of-the-art MDD methods, achieving significant gains in retrieval performance (e.g., +9.4 IR@10, +6.3 TR@10 under the 100-pair setting) and offering up to 6.7 distillation speedup.

distillation, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: Asia (0.68)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Information Technology > Services (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Communications (0.67)

Add feedback

Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process

Neural Information Processing SystemsJun-14-2026, 07:41:52 GMT

Developing effective multimodal fusion approaches has become increasingly essential in many real-world scenarios, such as health care and finance. The key challenge is how to preserve the feature expressiveness in each modality while learning cross-modal interactions. Previous approaches primarily focus on the cross-modal alignment, while over-emphasis on the alignment of marginal distributions of modalities may impose excess regularization and obstruct meaningful representations within each modality. The Dirichlet process (DP) mixture model is a powerful Bayesian non-parametric method that can amplify the most prominent features by its richer-gets-richer property, which allocates increasing weights to them. Inspired by this unique characteristic of DP, we propose a new DP-driven multimodal learning framework that automatically achieves an optimal balance between prominent intra-modal representation learning and cross-modal alignment. Specifically, we assume that each modality follows a mixture of multivariate Gaussian distributions and further adopt DP to calculate the mixture weights for all the components. This paradigm allows DP to dynamically allocate the contributions of features and select the most prominent ones, leveraging its richer-gets-richer property, thus facilitating multimodal feature fusion. Extensive experiments on several multimodal datasets demonstrate the superior performance of our model over other competitors.

artificial intelligence, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision

Neural Information Processing SystemsApr-24-2026, 11:51:36 GMT

In Vision-and-Language Navigation (VLN) task, an agent is asked to navigate inside 3D indoor environments following given instructions. Cross-modal alignment is one of the most critical challenges in VLN because the predicted trajectory needs to match the given instruction accurately. In this paper, we address the cross-modal alignment challenge from the perspective of fine-grain. Firstly, to alleviate weak cross-modal alignment supervision from coarse-grained data, we introduce a human-annotated fine-grained VLN dataset, namely Landmark-RxR. Secondly, to further enhance local cross-modal alignment under fine-grained supervision, we investigate the focal-oriented rewards with soft and hard forms, by focusing on the critical points sampled from fine-grained Landmark-RxR. Moreover, to fully evaluate the navigation process, we also propose a re-initialization mechanism that makes metrics insensitive to difficult points, which can cause the agent to deviate from the correct trajectories. Experimental results show that our agent has superior navigation performance on Landmark-RxR, en-RxR and R2R.

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country: Asia > China (0.14)

Genre: Research Report (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)

Add feedback

018b59ce1fd616d874afad0f44ba338d-Paper.pdf

Neural Information Processing SystemsApr-24-2026, 10:09:18 GMT

artificial intelligence, machine learning, relation, (16 more...)

Neural Information Processing Systems

Country: Asia > India (0.28)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection - Supplemental Material - 1 More Ablation Studies 1.1 Number of test categories in evaluation

Neural Information Processing SystemsFeb-17-2026, 15:20:08 GMT

The distillation can also cover 3D object boxes on the background.

artificial intelligence, category, machine learning, (16 more...)

Neural Information Processing Systems

Industry: Information Technology > Security & Privacy (0.94)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Security & Privacy (0.94)
Information Technology > Artificial Intelligence > Vision (0.67)

Add feedback

e352b765e625934ce86919995e2371aa-Paper-Conference.pdf

Neural Information Processing SystemsFeb-17-2026, 15:20:05 GMT

category, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: Asia > China > Hong Kong (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.73)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Add feedback

Landmark-RxR: SolvingVision-and-Language NavigationwithFine-GrainedAlignmentSupervision

Neural Information Processing SystemsFeb-7-2026, 08:05:27 GMT

In Vision-and-Language Navigation (VLN) task, an agent is asked to navigate inside 3D indoor environments following given instructions. Cross-modal alignment is one of the most critical challenges in VLN because the predicted trajectory needs to match the given instruction accurately.

machine learning, natural language, trajectory, (18 more...)

Neural Information Processing Systems

Country: Asia > China > Beijing > Beijing (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Add feedback

Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings

Akbarian, Fatemeh, Baninajjar, Anahita, Zhang, Yingyi, Balashankar, Ananth, Aminifar, Amir

arXiv.org Artificial IntelligenceDec-1-2025

Abstract--Multi-modal foundation models align images, text, and other modalities in a shared embedding space but remain vulnerable to adversarial illusions [35], where imperceptible perturbations disrupt cross-modal alignment and mislead downstream tasks. T o counteract the effects of adversarial illusions, we propose a task-agnostic mitigation mechanism that reconstructs the input from the attacker's perturbed input through generative models, e.g., V ariational Autoencoders (V AEs), to maintain natural alignment. T o further enhance our proposed defense mechanism, we adopt a generative sampling strategy combined with a consensus-based aggregation scheme over the outcomes of the generated samples. Our experiments on the state-of-the-art multi-modal encoders show that our approach substantially reduces the illusion attack success rates to near-zero and improves cross-modal alignment by 4% (42 46) and 11% (32 43) in unperturbed and perturbed input settings respectively, providing an effective and model-agnostic defense against adversarial illusions. Multi-modal foundation models have rapidly advanced the frontier of visual and linguistic understanding. Foundation models such as CLIP [19], ALIGN [11], and ImageBind [8] align a variety of heterogeneous modalities including images, text, and other modalities within a shared embedding space, thereby enabling zero-shot classification, cross-modal retrieval, and generative conditioning. The shared embedding space that underpins cross-modal flexibility simultaneously introduces a new attack surface, giving rise to adversarial illusions [35]. As downstream tasks directly rely on the integrity of this shared representation, even small perturbations in one modality can induce semantic misalignment across others, misleading models that depend on the embedding for retrieval, captioning, or generative conditioning. Defending against such cross-modal attacks presents unique challenges.

accuracy, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.21893

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

You, Xiaoxing, Huang, Qiang, Li, Lingyu, Zhang, Chi, Liu, Xiaopeng, Zhang, Min, Yu, Jun

arXiv.org Artificial IntelligenceNov-27-2025

News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.

large language model, machine learning, merge, (22 more...)

arXiv.org Artificial Intelligence

2511.21002

Country: Asia > China (0.68)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Sports (0.93)
Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CASPER: Cross-modal Alignment of Spatial and single-cell Profiles for Expression Recovery

Kumar, Amit, Kaur, Maninder, Mall, Raghvendra, Gupta, Sukrit

arXiv.org Artificial IntelligenceNov-20-2025

Spatial Transcriptomics enables mapping of gene expression within its native tissue context, but current platforms measure only a limited set of genes due to experimental constraints and excessive costs. To overcome this, computational models integrate Single-Cell RNA Sequencing data with Spatial Transcriptomics to predict unmeasured genes. We propose CASPER, a cross-attention based framework that predicts unmeasured gene expression in Spatial Transcriptomics by leveraging centroid-level representations from Single-Cell RNA Sequencing. We performed rigorous testing over four state-of-the-art Spatial Transcriptomics/Single-Cell RNA Sequencing dataset pairs across four existing baseline models. CASPER shows significant improvement in nine out of the twelve metrics for our experiments. This work paves the way for further work in Spatial Transcriptomics to Single-Cell RNA Sequencing modality translation. The code for CASPER is available at https://github.com/AI4Med-Lab/CASPER.

artificial intelligence, gene expression, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.15139

Country: Asia > Middle East > Qatar (0.29)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Filters

Collaborating Authors

cross-modal alignment

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation

Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process

Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision

018b59ce1fd616d874afad0f44ba338d-Paper.pdf

CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection - Supplemental Material - 1 More Ablation Studies 1.1 Number of test categories in evaluation

e352b765e625934ce86919995e2371aa-Paper-Conference.pdf

Landmark-RxR: SolvingVision-and-Language NavigationwithFine-GrainedAlignmentSupervision

Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings

Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

CASPER: Cross-modal Alignment of Spatial and single-cell Profiles for Expression Recovery