Towards Mitigating Spurious Correlations in the Wild: A Benchmark and a more Realistic Dataset