Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks