Graphical user interface agents optimization for visual instruction grounding using multi-modal artificial intelligence systems