Counterfactual Evolution of Multimodal Datasets via Visual Programming
–Neural Information Processing Systems
The rapid development of Multimodal Large Language Models (MLLMs) poses increasing demands on the diversity and complexity of multimodal datasets. Yet manual annotation pipelines can no longer keep pace. Existing augmentation methods often follow fixed rules and lack verifiable control over sample diversity and reasoning complexity. To address this, we introduce Scalable COunterfactual Program Evolution (SCOPE), a framework that uses symbolic Visual Programming to guide program evolution via counterfactual reasoning. SCOPE performs the three steps of counterfactual inference: (1) Abduction, by generating verifiable programs to model reasoning associations; (2) Action, by intervening on program structure along three axes--reasoning path, visual context, and cross-instance composition; and (3) Prediction, by categorizing evolved instances by difficulty, structure, and input multiplicity. Based on this process, we build SCOPE-Train and SCOPE-Test, evolving benchmarks with expert validation. To support training, we propose MAP, a curriculum learning strategy that aligns model capacity with sample difficulty. Experiments show that SCOPEimproves reasoning performance, exposes model blind spots, and enhances visual dialog capabilities.
Neural Information Processing Systems
Jun-18-2026, 14:02:18 GMT
- Country:
- Asia (0.28)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.87)
- Research Report
- Technology:
- Information Technology
- Software > Programming Languages (1.00)
- Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Machine Learning > Neural Networks (0.93)
- Natural Language > Large Language Model (0.89)
- Information Technology