
I feel like supplying the target image as a reference kinda defeats the purpose and makes the task significantly easier. The target image should be something more semantic like a canonical microwave, not an image of the actual microwave which exists in the scene.