Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks