Identifying and Benchmarking Natural Out-of-Context Prediction Problems