Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

Open in new window