Goto

Collaborating Authors

 autograder


Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates

McCoy, Liam G., Haredasht, Fateme Nateghi, Chopra, Kanav, Wu, David, Wu, David JH, Conteh, Abass, Khemani, Sarita, Maharaj, Saloni Kumar, Ravi, Vishnu, Pahwa, Arth, Weng, Yingjie, Rosengaus, Leah, Giang, Lena, Li, Kelvin Zhenghao, Jee, Olivia, Shirvani, Daniel, Goh, Ethan, Chen, Jonathan H.

arXiv.org Artificial Intelligence

This study evaluates the capacity of large language models (LLMs) to generate structured clinical consultation templates for electronic consultation. Using 145 expert-crafted templates developed and routinely used by Stanford's eConsult team, we assess frontier models -- including o3, GPT-4o, Kimi K2, Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro -- for their ability to produce clinically coherent, concise, and prioritized clinical question schemas. Through a multi-agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis, we show that while models like o3 achieve high comprehensiveness (up to 92.2\%), they consistently generate excessively long templates and fail to correctly prioritize the most clinically important questions under length constraints. Performance varies across specialties, with significant degradation in narrative-driven fields such as psychiatry and pain medicine. Our findings demonstrate that LLMs can enhance structured clinical information exchange between physicians, while highlighting the need for more robust evaluation methods that capture a model's ability to prioritize clinically salient information within the time constraints of real-world physician communication.


Skewed Score: A statistical framework to assess autograders

Dubois, Magda, Coppock, Harry, Giulianelli, Mario, Flesch, Timo, Luettgau, Lennart, Ududec, Cozmin

arXiv.org Machine Learning

The evaluation of large language model (LLM) outputs is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they have shown mixed reliability and may exhibit systematic biases, depending on response type, scoring methodology, domain specificity, or other factors. Here we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to simultaneously assess their autograders while addressing their primary research questions (e.g., LLM evaluation). Our approach models evaluation outcomes (e.g., scores or pairwise preferences) as a function of properties of the grader (e.g., human vs. autograder) and the evaluated item (e.g., response length or the LLM that generated it), allowing for explicit quantification of scoring differences and potential biases within a unified framework. In addition, our method can be used to augment traditional metrics such as inter-rater agreement, by providing uncertainty estimates and clarifying sources of disagreement. Overall, this approach contributes to more robust and interpretable use of autograders in LLM evaluation, enabling both performance analysis and bias detection.


Generating Planning Feedback for Open-Ended Programming Exercises with LLMs

Demirtaş, Mehmet Arif, Zheng, Claire, Fowler, Max, Cunningham, Kathryn

arXiv.org Artificial Intelligence

To complete an open-ended programming exercise, students need to both plan a high-level solution and implement it using the appropriate syntax. However, these problems are often autograded on the correctness of the final submission through test cases, and students cannot get feedback on their planning process. Large language models (LLM) may be able to generate this feedback by detecting the overall code structure even for submissions with syntax errors. To this end, we propose an approach that detects which high-level goals and patterns (i.e. programming plans) exist in a student program with LLMs. We show that both the full GPT-4o model and a small variant (GPT-4o-mini) can detect these plans with remarkable accuracy, outperforming baselines inspired by conventional approaches to code analysis. We further show that the smaller, cost-effective variant (GPT-4o-mini) achieves results on par with state-of-the-art (GPT-4o) after fine-tuning, creating promising implications for smaller models for real-time grading. These smaller models can be incorporated into autograders for open-ended code-writing exercises to provide feedback for students' implicit planning skills, even when their program is syntactically incorrect. Furthermore, LLMs may be useful in providing feedback for problems in other domains where students start with a set of high-level solution steps and iteratively compute the output, such as math and physics problems.


Autograding Mathematical Induction Proofs with Natural Language Processing

Zhao, Chenyan, Silva, Mariana, Poulsen, Seth

arXiv.org Artificial Intelligence

Writing mathematical proofs has been identified as an important [1-3] and yet challenging topic [4] in computing education and mathematics education. A large body of research has shown that timely feedback is crucial to student learning [5, 6]. However, students are largely unable to receive timely feedback on written proofs due to the need to have proofs collected and hand-graded by instructors or teaching assistants. The ability to grade student proofs fully automatically with natural language processing (NLP) alleviates this need by allowing us to give students instant feedback on their proofs to let students iteratively enhance the quality of their proofs. In this paper, we propose a novel set of training methods and models capable of autograding freeform mathematical proofs, a problem at the intersection of mathematical proof education and Automatic Short Answer Grading (ASAG), by using existing NLP models and other machine learning techniques. Our proof autograder enables the development of grading systems that provide instant feedback to students without needing attention from instructors. It can also be deployed in large-scale educational platforms, allowing for more access for students. The main contributions of this paper are: Introducing the first pipeline of machine learning models capable of autograding mathematical proofs with similar accuracy to human graders Quantifying the amount of training data needed to achieve a satisfactory performance from the grading models Publishing an anonymized and labeled mathematical proof dataset that can be used in future model developments [7] Creating a set of autograded problems using the grading pipeline, and performing a user study that answers the following research questions: - Are students able to write better proofs by interacting with the autograder and the feedback it generates?


A StrongREJECT for Empty Jailbreaks

Souly, Alexandra, Lu, Qingyuan, Bowen, Dillon, Trinh, Tu, Hsieh, Elvis, Pandey, Sana, Abbeel, Pieter, Svegliato, Justin, Emmons, Scott, Watkins, Olivia, Toyer, Sam

arXiv.org Artificial Intelligence

The rise of large language models (LLMs) has drawn attention to the existence of "jailbreaks" that allow the models to be used maliciously. However, there is no standard benchmark for measuring the severity of a jailbreak, leaving authors of jailbreak papers to create their own. We show that these benchmarks often include vague or unanswerable questions and use grading criteria that are biased towards overestimating the misuse potential of low-quality model responses. Some jailbreak techniques make the problem worse by decreasing the quality of model responses even on benign questions: we show that several jailbreaking techniques substantially reduce the zero-shot performance of GPT-4 on MMLU. Jailbreaks can also make it harder to elicit harmful responses from an "uncensored" open-source model. We present a new benchmark, StrongREJECT, which better discriminates between effective and ineffective jailbreaks by using a higher-quality question set and a more accurate response grading algorithm. We show that our new grading scheme better accords with human judgment of response quality and overall jailbreak effectiveness, especially on the sort of low-quality responses that contribute the most to over-estimation of jailbreak performance on existing benchmarks. We release our code and data at https://github.com/alexandrasouly/strongreject.


An Analysis of Programming Course Evaluations Before and After the Introduction of an Autograder

Hagerer, Gerhard Johann, Lahesoo, Laura, Anschütz, Miriam, Krusche, Stephan, Groh, Georg

arXiv.org Artificial Intelligence

Commonly, introductory programming courses in higher education institutions have hundreds of participating students eager to learn to program. The manual effort for reviewing the submitted source code and for providing feedback can no longer be managed. Manually reviewing the submitted homework can be subjective and unfair, particularly if many tutors are responsible for grading. Different autograders can help in this situation; however, there is a lack of knowledge about how autograders can impact students' overall perception of programming classes and teaching. This is relevant for course organizers and institutions to keep their programming courses attractive while coping with increasing students. This paper studies the answers to the standardized university evaluation questionnaires of multiple large-scale foundational computer science courses which recently introduced autograding. The differences before and after this intervention are analyzed. By incorporating additional observations, we hypothesize how the autograder might have contributed to the significant changes in the data, such as, improved interactions between tutors and students, improved overall course quality, improved learning success, increased time spent, and reduced difficulty. This qualitative study aims to provide hypotheses for future research to define and conduct quantitative surveys and data analysis. The autograder technology can be validated as a teaching method to improve student satisfaction with programming courses.


Berkeley AI Materials

#artificialintelligence

In this project, you will implement value iteration and Q-learning. You will test your agents first on Gridworld (from class), then apply them to a simulated robot controller (Crawler) and Pacman. As in previous projects, this project includes an autograder for you to grade your solutions on your machine. See the autograder tutorial in Project 0 for more information about using the autograder. Files to Edit and Submit: You will fill in portions of valueIterationAgents.py, qlearningAgents.py,