A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions