CAESAR: An Embodied Simulator for Generating Multimodal Referring Expression Datasets