Rethinking the Reasonability of the Test Set for Simultaneous Machine Translation