Do Contemporary CATE Models Capture Real-World Heterogeneity? Findings from a Large-Scale Benchmark