Holistic chemical evaluation reveals pitfalls in reaction prediction models