FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation
–Neural Information Processing Systems
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling, enabling both accurate content understanding and flexible editing. However, current approaches treat "what to see" and "how to edit" separately: they either perform isolated object segmentation or utilize segmentation masks merely as conditional prompts for local edit generation tasks, often relying on multiple disjointed models. To bridge these gaps, we introduce FOCUS, a unified LVLM that integrates segmentationaware perception and controllable object-centric generation within an end-to-end framework. FOCUS employs a dual-branch visual encoder to simultaneously capture global semantic context and fine-grained spatial details. In addition, we leverage a MoVQGAN-based visual tokenizer to produce discrete visual tokens that enhance generation quality. To enable accurate and controllable image editing, we propose a progressive multi-stage training pipeline, where segmentation masks are jointly optimized and used as spatial condition prompts to guide the diffusion decoder.
Neural Information Processing Systems
Jun-16-2026, 01:36:12 GMT
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.93)
- Research Report
- Industry:
- Media > Photography (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Natural Language
- Text Processing (1.00)
- Large Language Model (1.00)
- Chatbot (1.00)
- Machine Learning > Neural Networks
- Deep Learning (0.67)
- Information Technology > Artificial Intelligence