VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks