CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes

Nguyen, Tuan, Khan, Naseem, Khalil, Issa

arXiv.org Artificial Intelligence 

Unlike traditional text-to-image generation, where the entire image is synthesized from scratch, instruction-guided editing targets real images and modifies specific semantic attributes (such as object identity, background context, or visual style) while preserving global visual coherence. These manipulations are particularly concerning from a cybersecurity standpoint because they maintain the illusion of authenticity while enabling adversaries to alter identity, fabricate visual evidence, or inject misinformation into trusted media pipelines. As illustrated in Figure 2, the instruction-guided image editing pipeline comprises three key AI components, each playing a distinct role in enabling semantically precise and visually coherent manipulations. 4 Figure 2: Malicious Image Manipulation Pipeline. A threat actor uses generative AI tools to manipulate specific elements of an image, leveraging image translation and understanding models to guide semantic edits. These capabilities facilitate identity obfuscation, impersonation, and disinformation. First, an image translation model is used to convert the raw source image into a descriptive textual caption that semantically captures its visual content. This step, commonly implemented with models like CLIP [22], or BLIP-2 [23], provides a language-based anchor that enables subsequent manipulation. For example, a facial image may be described as "a girl wearing a blue and white striped shirt", forming the basis for meaningful transformation prompts.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found