Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation