Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation