Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations