AITopics | clipscore

Collaborating Authors

clipscore

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

Neural Information Processing SystemsMar-18-2026, 16:31:12 GMT

Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e.g., CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce $\textbf{negCLIPLoss}$, a method inspired by CLIP training loss that adds the alignment between one sample and its contrastive pairs as an extra normalization term to CLIPScore for better quality measurement.

artificial intelligence, machine learning, natural language, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.85)
Information Technology > Artificial Intelligence > Machine Learning (0.57)

Add feedback

Supplementary Material for Intrinsic Object-Centric Image Similarity

Neural Information Processing SystemsFeb-15-2026, 13:04:08 GMT

Each of the five groups of images represents an image set.

artificial intelligence, dataset, similarity, (15 more...)

Neural Information Processing Systems

Country: North America > United States > California > Santa Clara County > Palo Alto (0.05)

Genre: Research Report (0.47)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

1b33deb9e1e7c31f05300b1c2d4a4f7d-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 20:28:06 GMT

dataset, negcliploss, normsim, (15 more...)

Neural Information Processing Systems

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Michigan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment

Rahman, Md. Mahfuzur, Gupta, Kishor Datta, Kamal, Marufa, Rahman, Fahad, Siddique, Sunzida, Hasan, Ahmed Rafi, Haque, Mohd Ariful, George, Roy

arXiv.org Artificial IntelligenceNov-25-2025

The processes of classification and segmentation utilizing artificial intelligence play a vital role in the automation of disaster assessments. However, contemporary VLMs produce details that are inadequately aligned with the objectives of disaster assessment, primarily due to their deficiency in domain knowledge and the absence of a more refined descriptive process. This research presents the Vision Language Caption Enhancer (VLCE), a dedicated multimodal framework aimed at integrating external semantic knowledge from ConceptNet and WordNet to improve the captioning process. The objective is to produce disaster-specific descriptions that effectively convert raw visual data into actionable intelligence. VLCE utilizes two separate architectures: a CNN-LSTM model that incorporates a ResNet50 backbone, pretrained on EuroSat for satellite imagery (xBD dataset), and a Vision Transformer developed for UAV imagery (RescueNet dataset). In various architectural frameworks and datasets, VLCE exhibits a consistent advantage over baseline models such as LLaVA and QwenVL. Our optimal configuration reaches an impressive 95.33\% on InfoMetIC for UAV imagery while also demonstrating strong performance across satellite imagery. The proposed framework signifies a significant transition from basic visual classification to the generation of comprehensive situational intelligence, demonstrating immediate applicability for implementation in real-time disaster assessment systems.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.21609

Country:

North America > United States (0.28)
Asia (0.28)

Genre: Research Report (0.64)

Industry:

Health & Medicine (0.68)
Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.56)
Education (0.48)
Media > Photography (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection

Chen, Yulin, Wang, Zeyuan, Yu, Tianyuan, Wei, Yingmei, Bai, Liang

arXiv.org Artificial IntelligenceNov-11-2025

The well-aligned attribute of CLIP-based models enables its effective application like CLIPscore as a widely adopted image quality assessment metric. However, such a CLIP-based metric is vulnerable for its delicate multimodal alignment. In this work, we propose \textbf{FoCLIP}, a feature-space misalignment framework for fooling CLIP-based image quality metric. Based on the stochastic gradient descent technique, FoCLIP integrates three key components to construct fooling examples: feature alignment as the core module to reduce image-text modality gaps, the score distribution balance module and pixel-guard regularization, which collectively optimize multimodal output equilibrium between CLIPscore performance and image quality. Such a design can be engineered to maximize the CLIPscore predictions across diverse input prompts, despite exhibiting either visual unrecognizability or semantic incongruence with the corresponding adversarial prompts from human perceptual perspectives. Experiments on ten artistic masterpiece prompts and ImageNet subsets demonstrate that optimized images can achieve significant improvement in CLIPscore while preserving high visual fidelity. In addition, we found that grayscale conversion induces significant feature degradation in fooling images, exhibiting noticeable CLIPscore reduction while preserving statistical consistency with original images. Inspired by this phenomenon, we propose a color channel sensitivity-driven tampering detection mechanism that achieves 91% accuracy on standard benchmarks. In conclusion, this work establishes a practical pathway for feature misalignment in CLIP-based multimodal systems and the corresponding defense method.

artificial intelligence, clipscore, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2511.06947

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.69)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

DiagramEval: Evaluating LLM-Generated Diagrams via Graphs

Liang, Chumeng, You, Jiaxuan

arXiv.org Artificial IntelligenceNov-3-2025

Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: https://github.com/ulab-uiuc/diagram-eval.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.25761

Country: North America > United States > Illinois (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

1b33deb9e1e7c31f05300b1c2d4a4f7d-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 20:04:15 GMT

clipscore, dataset, negcliploss, (14 more...)

Neural Information Processing Systems

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Michigan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
(2 more...)

Add feedback

An Object-Focused Framework for Evaluating Text-to-Image Alignment

Neural Information Processing SystemsOct-9-2025, 03:33:28 GMT

Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.04)

Genre: Research Report (0.68)

Technology: