grader
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Massachusetts (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
Human or AI? Comparing Design Thinking Assessments by Teaching Assistants and Bots
Khan, Sumbul, Liow, Wei Ting, Ang, Lay Kee
ORCID: 0000 -0003-2811-1194 Abstract --As design thinking education is growing in secondary and tertiary education, educators face a mounting challenge of evaluating creative artefacts that comprise visual and textual elements. Traditional, rubric-based methods of assessment are laborious, time-consuming, and inconsistent, due to their reliance on Teaching Assistants (TAs) in large, multi - section cohorts. This paper presents an exploratory study to investigate the reliability and perceived accuracy of AI -assisted assessment vis -à -vis TA-assisted assessment in evaluating student posters in design thinking education. Two activities were conducted with 33 Ministry of Education (MOE), Singapore school teachers, with the objective (1) to compare AI -generated scores with TA grading across three key dimensions: empathy and user understanding, identification of pain points and opportunities, and visual communication, and (2) to understand teacher preferences for AI-assigned, TA-assigned, and hybrid scores. Results showed low statistical agreement between instructor and AI scores for empathy and pain points, though slightly higher alignment for visual communication. Teachers generally preferred TA -assigned scores in six of ten samples. Qualitative feedback highlighted AI's potential for formative feedback, consistency, and student self -reflection, but raised concerns about its limitations in capturing contextual nuance and creative insight. The study underscores the need for hybrid assessment models that integrate computational efficiency with human insights . This research contributes to the evolving conversation around responsible AI adoption in creative disciplines, emphasizing the balance between automation and human judgment for scalable and pedagogically sound assessment practices. Design thinking is a human-centered approach to innovation that draws from the designer's toolkit to integrate the needs of people, the possibilities of technology, and the requirements for business success. It is a non - linear, iterative process that teams use to understand users, challenge assumptions, redefine problems, and create innovative solutions to prototype and test.
- Asia > Singapore (0.27)
- Europe > Netherlands > South Holland > Delft (0.04)
- Education > Educational Setting > Higher Education (1.00)
- Education > Assessment & Standards (1.00)
- Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.68)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.86)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
LLM-based Automated Grading with Human-in-the-Loop
Chu, Yucheng, Li, Hang, Yang, Kaiqi, Copur-Gencturk, Yasemin, Tang, Jiliang
The rise of artificial intelligence (AI) technologies, particularly large language models (LLMs), has brought significant advancements to the field of education. Among various applications, automatic short answer grading (ASAG), which focuses on evaluating open-ended textual responses, has seen remarkable progress with the introduction of LLMs. These models not only enhance grading performance compared to traditional ASAG approaches but also move beyond simple comparisons with predefined "golden" answers, enabling more sophisticated grading scenarios, such as rubric-based evaluation. However, existing LLM-powered methods still face challenges in achieving human-level grading performance in rubric-based assessments due to their reliance on fully automated approaches. In this work, we explore the potential of LLMs in ASAG tasks by leveraging their interactive capabilities through a human-in-the-loop (HITL) approach. Our proposed framework, GradeHITL, utilizes the generative properties of LLMs to pose questions to human experts, incorporating their insights to refine grading rubrics dynamically. This adaptive process significantly improves grading accuracy, outperforming existing methods and bringing ASAG closer to human-level evaluation.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Michigan (0.05)
LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation
Byun, Grace, Rajwal, Swati, Choi, Jinho D.
Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Towards Robust Mathematical Reasoning
Luong, Thang, Hwang, Dawsen, Nguyen, Hoang H., Ghiasi, Golnaz, Chervonyi, Yuri, Seo, Insuk, Kim, Junsu, Bingham, Garrett, Lee, Jonathan, Mishra, Swaroop, Zhai, Alex, Hu, Clara Huiyi, Michalewski, Henryk, Kim, Jimin, Ahn, Jeonghyun, Bae, Junhwi, Song, Xingyou, Trinh, Trieu H., Le, Quoc V., Jung, Junehyuk
Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.
- North America > United States (0.04)
- Europe > Austria (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
- (2 more...)
Switchboard-Affect: Emotion Perception Labels from Conversational Speech
Romana, Amrit, Narain, Jaya, Tran, Tien Dung, Davis, Andrea, Fong, Jason, Rasipuram, Ramya, Mitra, Vikramjit
Abstract--Understanding the nuances of speech emotion dataset curation and labeling is essential for assessing speech emotion recognition (SER) model potential in real-world applications. Most training and evaluation datasets contain acted or pseudo-acted speech (e.g., podcast speech) in which emotion expressions may be exaggerated or otherwise intentionally modified. Furthermore, datasets labeled based on crowd perception often lack transparency regarding the guidelines given to annotators. These factors make it difficult to understand model performance and pinpoint necessary areas for improvement. T o address this gap, we identified the Switchboard corpus as a promising source of naturalistic conversational speech, and we trained a crowd to label the dataset for categorical emotions (anger, contempt, disgust, fear, sadness, surprise, happiness, tenderness, calmness, and neutral) and dimensional attributes (activation, valence, and dominance). We refer to this label set as Switchboard-Affect (SWB-Affect). In this work, we present our approach in detail, including the definitions provided to annotators and an analysis of the lexical and paralinguistic cues that may have played a role in their perception. In addition, we evaluate state-of-the-art SER models, and we find variable performance across the emotion categories with especially poor generalization for anger . These findings underscore the importance of evaluation with datasets that capture natural affective variations in speech. We release the labels for SWB-Affect to enable further analysis in this domain. Speech emotion recognition (SER) has the potential to enhance human-computer interaction, improve our ability to monitor mental health and well-being [1], [2], and better understand customer service, entertainment, and education experiences [3], [4].
- North America > United States (0.05)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > Canada > Quebec (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
- Information Technology > Artificial Intelligence > Speech (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Online Rubrics Elicitation from Pairwise Comparisons
Rezaei, MohammadHossein, Vacareanu, Robert, Wang, Zihao, Wang, Clinton, Liu, Bing, He, Yunzhong, Akyürek, Afra Feyza
Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.
- Europe > Austria > Vienna (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > United States > Arizona (0.04)
- Asia > China (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
Patwardhan, Tejal, Dias, Rachel, Proehl, Elizabeth, Kim, Grace, Wang, Michele, Watkins, Olivia, Fishman, Simón Posada, Aljubeh, Marwan, Thacker, Phoebe, Fauconnet, Laurance, Kim, Natalie S., Chao, Patrick, Miserendino, Samuel, Chabot, Gildas, Li, David, Sharman, Michael, Barr, Alexandra, Glaese, Amelia, Tworek, Jerry
We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and that the current best frontier models are approaching industry experts in deliverable quality. We analyze the potential for frontier models, when paired with human oversight, to perform GDPval tasks cheaper and faster than unaided experts. We also demonstrate that increased reasoning effort, increased task context, and increased scaffolding improves model performance on GDPval. Finally, we open-source a gold subset of 220 tasks and provide a public automated grading service at evals.openai.com to facilitate future research in understanding real-world model capabilities.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts (0.04)
- North America > United States > California (0.04)
- (3 more...)
- Research Report (0.64)
- Workflow (0.46)
Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning
Ye, Zhiling, Yue, Yun, Wang, Haowen, Han, Xudong, Jiang, Jiadi, Wei, Cheng, Fan, Lei, Liang, Jiaxin, Zhang, Shuowen, Li, Ji, Guo, Chunxiao, Wang, Jian, Wei, Peng, Gu, Jinjie
Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader. Motivated by this, we introduce Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning, a lightweight framework that enables faster and more resource-efficient training while surpassing baselines. Remarkably, on Qwen3-32B, training with just the 4000-sample HealthBench Easy subset is sufficient to obtain a model that exceeds GPT-5 on HealthBench Hard. Incorporating a small amount of teacher-graded data further enhances performance for less capable models.
- Europe > France (0.05)
- North America > United States > Massachusetts (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (6 more...)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Public Health (0.67)
- Government (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)
The NTNU System at the S&I Challenge 2025 SLA Open Track
Lin, Hong-Yun, Lo, Tien-Hong, Fang, Yu-Hsuan, Lin, Jhen-Ke, Wang, Chung-Chun, Lu, Hao-Chien, Chen, Berlin
A recent line of research on spoken language assessment (SLA) employs neural models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency across linguistic and acoustic modalities. Although both models effectively capture features relevant to oral competence, each exhibits modality-specific limitations. BERT-based methods rely on ASR transcripts, which often fail to capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods excel at modeling acoustic features but lack semantic interpretability. To overcome these limitations, we propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy. The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025, securing second place in the competition. For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.
- Asia > Taiwan (0.05)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (2 more...)