Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks

Open in new window