Bayesian Evaluation of Large Language Model Behavior