Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark