CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes

Nguyen, Tuan, Khan, Naseem, Khalil, Issa

Apr-29-2025–arXiv.org Artificial Intelligence

Unlike traditional text-to-image generation, where the entire image is synthesized from scratch, instruction-guided editing targets real images and modifies specific semantic attributes (such as object identity, background context, or visual style) while preserving global visual coherence. These manipulations are particularly concerning from a cybersecurity standpoint because they maintain the illusion of authenticity while enabling adversaries to alter identity, fabricate visual evidence, or inject misinformation into trusted media pipelines. As illustrated in Figure 2, the instruction-guided image editing pipeline comprises three key AI components, each playing a distinct role in enabling semantically precise and visually coherent manipulations. 4 Figure 2: Malicious Image Manipulation Pipeline. A threat actor uses generative AI tools to manipulate specific elements of an image, leveraging image translation and understanding models to guide semantic edits. These capabilities facilitate identity obfuscation, impersonation, and disinformation. First, an image translation model is used to convert the raw source image into a descriptive textual caption that semantically captures its visual content. This step, commonly implemented with models like CLIP [22], or BLIP-2 [23], provides a language-based anchor that enables subsequent manipulation. For example, a facial image may be described as "a girl wearing a blue and white striped shirt", forming the basis for meaningful transformation prompts.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Apr-29-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.68)

Genre:
- Research Report > New Finding (0.93)

Industry:
- Media (1.00)
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Natural Language (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning > Generative AI (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found