DIS-CO: Discovering Copyrighted Content in VLMs Training Data

Duarte, André V., Zhao, Xuandong, Oliveira, Arlindo L., Li, Lei

Feb-25-2025–arXiv.org Artificial Intelligence

How can we verify whether copyrighted content was used to train a large vision-language model (VLM) without direct access to its training data? Motivated by the hypothesis that a VLM is able to recognize images from its training corpus, we propose DIS-CO, a novel approach to infer the inclusion of copyrighted content during the model's development. By repeatedly querying a VLM with specific frames from targeted copyrighted material, DIS-CO extracts the content's identity through free-form text completions. To assess its effectiveness, we introduce MovieTection, a benchmark comprising 14,000 frames paired with detailed captions, drawn from films released both before and after a model's training cutoff. Our results show that DIS-CO significantly improves detection performance, nearly doubling the average AUC of the best prior method on models with logits available. Our findings also highlight a broader concern: all tested models appear to have been exposed to some extent to copyrighted content. Our code and data are available at https://github.com/avduarte333/DIS-CO

dis-co, discovering copyrighted content, movie, (14 more...)

arXiv.org Artificial Intelligence

Feb-25-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California (0.04)
  - Pennsylvania > Allegheny County
    - Pittsburgh (0.04)
  - Florida > Miami-Dade County
    - Miami (0.04)
- Europe
  - Slovenia (0.04)
  - Portugal (0.04)
  - Switzerland > Zürich
    - Zürich (0.14)
- Asia
  - Singapore (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
- Law (1.00)
- Information Technology (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.72)
  - Machine Learning
    - Neural Networks > Deep Learning (0.72)
    - Performance Analysis > Accuracy (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found