AITopics

Country:

North America > Canada > Quebec > Montreal (0.14)
North America > United States (0.14)
South America > Colombia > Meta Department > Villavicencio (0.04)
(7 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Law (1.00)
Education (1.00)
Government (0.67)
(3 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Neural Information Processing SystemsDec-25-2025, 17:06:51 GMT

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

Text-guided image editing is widely needed in daily life, ranging from personal use to professional applications such as Photoshop.However, existing methods are either zero-shot or trained on an automatically synthesized dataset, which contains a high volume of noise.Thus, they still require lots of manual tuning to produce desirable outcomes in practice.To address this issue, we introduce MagicBrush, the first large-scale, manually annotated dataset for instruction-guided real image editing that covers diverse scenarios: single-turn, multi-turn, mask-provided, and mask-free editing.MagicBrush comprises over 10K manually annotated triplets (source image, instruction, target image), which supports trainining large-scale text-guided image editing models.We fine-tune InstructPix2Pix on MagicBrush and show that the new model can produce much better images according to human evaluation.We further conduct extensive experiments to evaluate current image editing baselines from multiple dimensions including quantitative, qualitative, and human evaluations.The results reveal the challenging nature of our dataset and the gap between current baselines and real-world editing needs.

annotated dataset, instruction-guided image editing, magicbrush, (3 more...)

Industry: Media > Photography (1.00)

Technology: Information Technology > Artificial Intelligence (0.40)

Neural Information Processing SystemsOct-11-2025, 00:18:49 GMT

Learning Action and Reasoning-Centric Image Editing from Videos and Simulations

Object, attribute or stylistic changes can be learned from visually static datasets. On the other hand, high-quality data for action and reasoning-centric edits is scarce and has to come from entirely different sources that cover e.g.

dataset, editing, reasoning, (16 more...)

Country:

North America > Canada > Quebec > Montreal (0.14)
North America > United States (0.14)
South America > Colombia > Meta Department > Villavicencio (0.04)
(7 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Law (1.00)
Education (1.00)
Government (0.67)
(3 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Neural Information Processing SystemsOct-9-2025, 17:35:35 GMT

05a30a0fc9e6bacdd3abd4ca8508a9e6-Supplemental-Datasets_and_Benchmarks_Track.pdf

dataset, editing, instruction, (16 more...)

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Maryland > Baltimore (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(2 more...)

Industry:

Leisure & Entertainment > Sports (1.00)
Transportation (0.69)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsOct-9-2025, 17:35:32 GMT

05a30a0fc9e6bacdd3abd4ca8508a9e6-Paper-Datasets_and_Benchmarks_Track.pdf

dataset, editing, instruction, (17 more...)

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > China (0.04)
(4 more...)

Genre: Research Report (0.68)

Industry:

Leisure & Entertainment > Sports (1.00)
Transportation (0.93)
Information Technology (0.92)
Government (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceJun-9-2025

Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models

Qiu, Yifu, Ziser, Yftah, Korhonen, Anna, Cohen, Shay B., Ponti, Edoardo M.

To what extent do vision-and-language foundation models possess a realistic world model (observation $\times$ action $\rightarrow$ observation) and a dynamics model (observation $\times$ observation $\rightarrow$ action), when actions are expressed through language? While open-source foundation models struggle with both, we find that fine-tuning them to acquire a dynamics model through supervision is significantly easier than acquiring a world model. In turn, dynamics models can be used to bootstrap world models through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, the dynamics model can annotate actions for unlabelled pairs of video frame observations to expand the training data. We further propose a new objective, where image tokens in observation pairs are weighted by their importance, as predicted by a recognition model. Secondly, the dynamics models can assign rewards to multiple samples of the world model to score them, effectively guiding search at inference time. We evaluate the world models resulting from both strategies through the task of action-centric image editing on Aurora-Bench. Our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of $15\%$ on real-world subsets according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

large language model, machine learning, natural language, (20 more...)

2506.06006

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
(2 more...)

Nguyen, Tuan, Khan, Naseem, Khalil, Issa

CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes

arXiv.org Artificial IntelligenceApr-29-2025

Unlike traditional text-to-image generation, where the entire image is synthesized from scratch, instruction-guided editing targets real images and modifies specific semantic attributes (such as object identity, background context, or visual style) while preserving global visual coherence. These manipulations are particularly concerning from a cybersecurity standpoint because they maintain the illusion of authenticity while enabling adversaries to alter identity, fabricate visual evidence, or inject misinformation into trusted media pipelines. As illustrated in Figure 2, the instruction-guided image editing pipeline comprises three key AI components, each playing a distinct role in enabling semantically precise and visually coherent manipulations. 4 Figure 2: Malicious Image Manipulation Pipeline. A threat actor uses generative AI tools to manipulate specific elements of an image, leveraging image translation and understanding models to guide semantic edits. These capabilities facilitate identity obfuscation, impersonation, and disinformation. First, an image translation model is used to convert the raw source image into a descriptive textual caption that semantically captures its visual content. This step, commonly implemented with models like CLIP [22], or BLIP-2 [23], provides a language-based anchor that enables subsequent manipulation. For example, a facial image may be described as "a girl wearing a blue and white striped shirt", forming the basis for meaningful transformation prompts.

artificial intelligence, machine learning, natural language, (18 more...)

2504.19212

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.93)

Industry:

Media (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.66)

Neural Information Processing SystemsJan-18-2025, 21:03:53 GMT

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

Text-guided image editing is widely needed in daily life, ranging from personal use to professional applications such as Photoshop.However, existing methods are either zero-shot or trained on an automatically synthesized dataset, which contains a high volume of noise.Thus, they still require lots of manual tuning to produce desirable outcomes in practice.To address this issue, we introduce MagicBrush, the first large-scale, manually annotated dataset for instruction-guided real image editing that covers diverse scenarios: single-turn, multi-turn, mask-provided, and mask-free editing.MagicBrush comprises over 10K manually annotated triplets (source image, instruction, target image), which supports trainining large-scale text-guided image editing models.We fine-tune InstructPix2Pix on MagicBrush and show that the new model can produce much better images according to human evaluation.We further conduct extensive experiments to evaluate current image editing baselines from multiple dimensions including quantitative, qualitative, and human evaluations.The results reveal the challenging nature of our dataset and the gap between current baselines and real-world editing needs.

annotated dataset, instruction-guided image editing, magicbrush, (1 more...)

Industry: Media > Photography (1.00)

Technology: Information Technology > Artificial Intelligence (0.44)

Stefanache, Stefan, Pérez, Lluís Pastor, Watanabe, Julen Costa, Tejedor, Ernesto Sanchez, Hofmann, Thomas, Simsar, Enis

PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM

arXiv.org Artificial IntelligenceOct-8-2024

Evaluating diffusion-based image-editing models is a crucial task in the field of Generative AI. Specifically, it is imperative to assess their capacity to execute diverse editing tasks while preserving the image content and realism. While recent developments in generative models have opened up previously unheard-of possibilities for image editing, conducting a thorough evaluation of these models remains a challenging and open task. The absence of a standardized evaluation benchmark, primarily due to the inherent need for a post-edit reference image for evaluation, further complicates this issue. Currently, evaluations often rely on established models such as CLIP or require human intervention for a comprehensive understanding of the performance of these image editing models. Our benchmark, PixLens, provides a comprehensive evaluation of both edit quality and latent representation disentanglement, contributing to the advancement and refinement of existing methodologies in the field.

category, edit type, evaluation, (15 more...)

2410.0571

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (1.00)

Industry: Media > Photography (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(2 more...)

Regev, Omer, Avrahami, Omri, Lischinski, Dani

Click2Mask: Local Editing with Dynamic Mask Generation

arXiv.org Artificial IntelligenceSep-12-2024

Recent advancements in generative models have revolutionized image generation and editing, making these tasks accessible to non-experts. This paper focuses on local image editing, particularly the task of adding new content to a loosely specified area. Existing methods often require a precise mask or a detailed description of the location, which can be cumbersome and prone to errors. We propose Click2Mask, a novel approach that simplifies the local editing process by requiring only a single point of reference (in addition to the content description). A mask is dynamically grown around this point during a Blended Latent Diffusion (BLD) process, guided by a masked CLIP-based semantic loss. Click2Mask surpasses the limitations of segmentation-based and fine-tuning dependent methods, offering a more user-friendly and contextually accurate solution. Our experiments demonstrate that Click2Mask not only minimizes user effort but also delivers competitive or superior local image manipulation results compared to SoTA methods, according to both human judgement and automatic metrics. Key contributions include the simplification of user input, the ability to freely add objects unconstrained by existing segments, and the integration potential of our dynamic mask approach within other editing methods.

click2mask, emu edit, magicbrush, (13 more...)