AITopics | image-caption pair

ALimitations and Societal

Neural Information Processing SystemsApr-30-2026, 02:28:50 GMT

Limitations One limitation of our model is its potential for data bias. KOSMOS-1 is trained on a2 web-scale multimodal corpus, which means that it is likely to be biased towards the data that it was3 trained on. This could lead to the model generating text that is biased towards certain demographics4 or viewpoints.5 Another limitation of KOSMOS-1 is its relatively small size compared to other large language models.6 This means that the model may not be able to learn as complex relationships between different7 modalities. This could lead to the model making mistakes when it is asked to perform tasks that8 require a deep understanding of multiple modalities.9 Finally, KOSMOS-1 only supports vision modality.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: North America > United States > Minnesota (0.28)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

2232e8fee69b150005ac420bfa83d705-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 21:10:05 GMT

caption, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Industry: Information Technology (0.33)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Results

Neural Information Processing SystemsApr-25-2026, 06:33:38 GMT

In addition to CYCLIP described in 2, we train two more instantiations of it by keeping either of the two consistency regularizers active in the loss objective (Eq. The instantiation trained by setting λ1 = 0and λ2 = 0.5is termed as C-CYCLIP as only cross-modal consistency regularizer term is added to the loss objective. Similarly, we get I-CYCLIP where only in-modal consistency regularizer is added to the loss by setting λ1 = 0.5 and λ2 = 0. We evaluate C-CYCLIP and I-CYCLIP on most of the experiments discussed in the main text to understand their zero-shot transfer ability on standard datasets and robustness to natural distribution shifts. A.1 Zero-shot Transfer Table 7 presents our results of the zero-shot transfer experiment described in 3.1. We find that CYCLIP outperforms its sub-variants and the CLIP model on the ImageNet1K dataset.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.78)

Add feedback

Déjà Vu Memorization in Vision–Language Models

Neural Information Processing SystemsMar-20-2026, 19:11:34 GMT

Vision-Language Models (VLMs) have emerged as the state-of-the-art representation learning solution, with myriads of downstream applications such as image classification, retrieval and generation. A natural question is whether these models memorize their training data, which also has implications for generalization. We propose a new method for measuring memorization in VLMs, which we call dèjá vu memorization. For VLMs trained on image-caption pairs, we show that the model indeed retains information about individual objects in the training images beyond what can be inferred from correlations or the image caption. We evaluate dèjá vu memorization at both sample and population level, and show that it is significant for OpenCLIP trained on as many as 50M image-caption pairs. Finally, we show that text randomization considerably mitigates memorization risk while only moderately impacting the model's downstream task performance.

artificial intelligence, machine learning, proceedings, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.61)

Add feedback

A Limitations and Societal Impacts

Neural Information Processing SystemsFeb-17-2026, 15:44:29 GMT

Limitations One limitation of our model is its potential for data bias. This could limit the applications of the model. MLLMs could be used to create fake news articles or social media posts. Hyperparameters Number of layers 24 Hidden size 2,048 FFN inner hidden size 8,192 Attention heads 32 Dropout 0.1 Attention dropout 0.1 Activation function GeLU [1] V ocabulary size 64,007 Soft tokens V size 64 Max length 2,048 Relative position embedding xPos [2] Initialization Magneto [3] Table 1: Hyperparameters of causal language model of K The detailed instruction tuning hyperparameters are listed in Table 3. The models are trained on web-scale multimodal corpora.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(2 more...)

Industry:

Social Sector (0.40)
Media (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Add feedback

c0b91f9a3587bf35287f41dba5d20233-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsFeb-16-2026, 22:09:09 GMT

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Washington (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
North America > Dominican Republic (0.04)

Genre: Research Report (0.68)

Industry:

Health & Medicine (1.00)
Education > Educational Setting (0.46)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.69)

Add feedback

Exploring Diverse In-Context Configurations for Image Captioning Xu Y ang

Neural Information Processing SystemsFeb-15-2026, 13:21:28 GMT

Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, i .

caption, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > China > Jiangsu Province (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks

Neural Information Processing SystemsFeb-8-2026, 22:15:07 GMT

Large-scale vision-language models such as CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) are trained using a multimodal contrastive

caption, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Industry: Information Technology > Security & Privacy (0.55)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

2cd36d327f33d47b372d4711edd08de0-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 02:25:26 GMT

clip model, dataset, experiment, (15 more...)

Neural Information Processing Systems

Country: North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.58)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)

Add feedback

Robust Contrastive Language-Image Pretraining against Data Poisoning and Backdoor Attacks

Neural Information Processing SystemsDec-24-2025, 05:16:34 GMT

Contrastive vision-language representation learning has achieved state-of-the-art performance for zero-shot classification, by learning from millions of image-caption pairs crawled from the internet. However, the massive data that powers large multimodal models such as CLIP, makes them extremely vulnerable to various types of targeted data poisoning and backdoor attacks. Despite this vulnerability, robust contrastive vision-language pre-training against such attacks has remained unaddressed. In this work, we propose RoCLIP, the first effective method for robust pre-training multimodal vision-language models against targeted data poisoning and backdoor attacks. RoCLIP effectively breaks the association between poisoned image-caption pairs by considering a relatively large and varying pool of random captions, and matching every image with the text that is most similar to it in the pool instead of its own caption, every few epochs.It also leverages image and text augmentations to further strengthen the defense and improve the performance of the model. Our extensive experiments show that RoCLIP renders state-of-the-art targeted data poisoning and backdoor attacks ineffective during pre-training CLIP models. In particular, RoCLIP decreases the success rate for targeted data poisoning attacks from 93.75% to 12.5% and that of backdoor attacks down to 0%, while improving the model's linear probe performance by 10% and maintains a similar zero shot performance compared to CLIP. By increasing the frequency of matching, RoCLIP is able to defend strong attacks, which add up to 1% poisoned examples to the data, and successfully maintain a low attack success rate of 12.5%, while trading off the performance on some tasks.

data poisoning and backdoor attack, name change, robust contrastive language-image pretraining, (4 more...)

Neural Information Processing Systems

Industry: Information Technology > Security & Privacy (1.00)

Technology: