How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation