Aligning Black-box Language Models with Human Judgments