A Simple Aerial Detection Baseline of Multimodal Language Models