Overconfident Oracles: Limitations of In Silico Sequence Design Benchmarking