Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?