AITopics | cc3m

Collaborating Authors

cc3m

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

11fc8c98b46d4cbdfe8157267228f7d7-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 16:10:40 GMT

artificial intelligence, conditional moe, machine learning, (16 more...)

Neural Information Processing Systems

Industry: Information Technology (0.32)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

11fc8c98b46d4cbdfe8157267228f7d7-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-7-2026, 13:05:57 GMT

We follow most of the settings in Uni-Perceiver [93]: cross-entropy loss with label smoothing of 0.1 is adopted for all tasks, and the negative samples for retrieval tasks are only from the local batch in the current GPU. We also apply the same data augmentation techniques as Uni-Perceiver [93] to image and video modalities to avoid overfitting. There are some setting changes to improve the training stability of the original Uni-Perceiver. Following [102], a uniform drop rate for stochastic depth is used across all encoder layers and are adapted according to the model size. Additionally, LayerScale [101] is used to facilitate the convergence of Transformer training, and the same initialization of10 3 is set to all models for simplicity.

artificial intelligence, cc3m, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > Canada > Ontario > Toronto (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.68)

Add feedback

Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data

Wang, Haonan, Huang, Minbin, Huang, Runhui, Hong, Lanqing, Xu, Hang, Hu, Tianyang, Liang, Xiaodan, Li, Zhenguo, Cheng, Hong, Kawaguchi, Kenji

arXiv.org Artificial IntelligenceFeb-3-2025

Contrastive Language-Image Pre-training (CLIP) has become the standard for cross-modal image-text representation learning. Improving CLIP typically requires additional data and retraining with new loss functions, but these demands raise resource and time costs, limiting practical use. In this work, we introduce HELIP, a cost-effective strategy that improves CLIP models by exploiting challenging text-image pairs within existing datasets in continuous training. This eliminates the need for additional data or extensive retraining. Moreover, HELIP integrates effortlessly into current training pipelines with minimal code modifications, allowing for quick and seamless implementation. On comprehensive benchmarks, HELIP consistently boosts existing models. In particular, within just two epochs of training, it improves zero-shot classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M, and YFCC15M datasets by 3.05%, 4.47%, and 10.1% , respectively. In addition, on fine-grained classification datasets, HELIP improves the zero-shot performance of CLIP and SLIP by an average of 8.4% and 18.6%, and their linear probe performance by an average of 9.5% and 3.0%. The code is publicly available at: https://github.com/haonan3/HELIP-NACCL-2025.git.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.05208

Country:

Asia > China > Hong Kong (0.04)
Asia > Singapore (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?

Hammoud, Hasan Abed Al Kader, Itani, Hani, Pizzati, Fabio, Torr, Philip, Bibi, Adel, Ghanem, Bernard

arXiv.org Artificial IntelligenceFeb-2-2024

We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic text-image pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we are able to generate synthetic datasets of images and corresponding captions at any scale, with no human intervention. With training at scale, SynthCLIP achieves performance comparable to CLIP models trained on real datasets. We also introduce SynthCI-30M, a purely synthetic dataset comprising 30 million captioned images. Our code, trained models, and generated data are released at https://github.com/hammoudhasan/SynthCLIP

caption, dataset, synthclip, (14 more...)

arXiv.org Artificial Intelligence

2402.01832

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Asia > Middle East > Saudi Arabia (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Add feedback

MLLMs-Augmented Visual-Language Representation Learning

Liu, Yanqing, Wang, Kai, Shao, Wenqi, Luo, Ping, Qiao, Yu, Shou, Mike Zheng, Zhang, Kaipeng, You, Yang

arXiv.org Artificial IntelligenceDec-1-2023

Visual-language pre-training (VLP) has achieved remarkable success in multi-modal tasks, largely attributed to the availability of large-scale image-text datasets. In this work, we demonstrate that multi-modal large language models (MLLMs) can enhance visual-language representation learning by improving data quality. Our approach is simple, utilizing MLLMs to extend multiple captions for each image. To prevent the bias introduced by MLLMs' hallucinations and intrinsic caption styles, we propose "text shearing" to maintain the same length for extended captions as that of the original captions. In image-text retrieval, our method consistently obtains 5.6 ~ 35.0% and 16.8 ~ 46.1% improvement on R@1 under the fine-tuning and zero-shot settings, respectively. Notably, we obtain zero-shot results that are comparable to fine-tuning on target datasets, which encourages more exploration of the versatile use of MLLMs.

caption, dataset, mllm, (16 more...)

arXiv.org Artificial Intelligence

2311.18765

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Europe > Poland (0.04)
Asia > Singapore (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

CgT-GAN: CLIP-guided Text GAN for Image Captioning

Yu, Jiarui, Li, Haoran, Hao, Yanbin, Zhu, Bin, Xu, Tong, He, Xiangnan

arXiv.org Artificial IntelligenceAug-23-2023

The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements for text embeddings. Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to "see" real visual modality. Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance. The caption generator is jointly rewarded based on the caption naturalness to human language calculated from the GAN's discriminator and the semantic guidance reward computed by the CLIP-based reward module. In addition to the cosine similarity as the semantic guidance reward (i.e., CLIP-cos), we further introduce a novel semantic guidance reward called CLIP-agg, which aligns the generated caption with a weighted text embedding by attentively aggregating the entire corpus. Experimental results on three subtasks (ZS-IC, In-UIC and Cross-UIC) show that CgT-GAN outperforms state-of-the-art methods significantly across all metrics. Code is available at https://github.com/Lihr747/CgtGAN.

cgt-gan, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3581783.3611891

2308.12045

Country:

North America > Canada > Ontario > National Capital Region > Ottawa (0.05)
Asia > China > Anhui Province > Hefei (0.04)
Asia > Singapore (0.04)
(2 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback