rubric
The Human Skill That Eludes AI
Why can't language models write well? I n a certain, strange way, generative AI peaked with OpenAI's GPT-2 seven years ago. Little known to anyone outside of tech circles, GPT-2 excelled at producing unexpected answers. "You could be like, 'Continue this story:,' and GPT-2 would be like, ','" Katy Gero, a poet and computer scientist who has been experimenting with language models since 2017, told me. "The models won't do that anymore." AI leaders boast about their models' superhuman technical abilities.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.56)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > Japan (0.04)
- (7 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Education (0.92)
- Health & Medicine (0.67)
- Information Technology > Security & Privacy (0.67)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)
- (12 more...)
- Workflow (0.67)
- Overview (0.67)
- Research Report > New Finding (0.45)
- Information Technology (1.00)
- Health & Medicine (1.00)
- Energy (1.00)
- (3 more...)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Europe > United Kingdom > Scotland > City of Glasgow > Glasgow (0.04)
- (9 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Overview (0.93)
- Law (0.67)
- Energy (0.46)
- Information Technology (0.46)
- Health & Medicine (0.46)
Human or AI? Comparing Design Thinking Assessments by Teaching Assistants and Bots
Khan, Sumbul, Liow, Wei Ting, Ang, Lay Kee
ORCID: 0000 -0003-2811-1194 Abstract --As design thinking education is growing in secondary and tertiary education, educators face a mounting challenge of evaluating creative artefacts that comprise visual and textual elements. Traditional, rubric-based methods of assessment are laborious, time-consuming, and inconsistent, due to their reliance on Teaching Assistants (TAs) in large, multi - section cohorts. This paper presents an exploratory study to investigate the reliability and perceived accuracy of AI -assisted assessment vis -à -vis TA-assisted assessment in evaluating student posters in design thinking education. Two activities were conducted with 33 Ministry of Education (MOE), Singapore school teachers, with the objective (1) to compare AI -generated scores with TA grading across three key dimensions: empathy and user understanding, identification of pain points and opportunities, and visual communication, and (2) to understand teacher preferences for AI-assigned, TA-assigned, and hybrid scores. Results showed low statistical agreement between instructor and AI scores for empathy and pain points, though slightly higher alignment for visual communication. Teachers generally preferred TA -assigned scores in six of ten samples. Qualitative feedback highlighted AI's potential for formative feedback, consistency, and student self -reflection, but raised concerns about its limitations in capturing contextual nuance and creative insight. The study underscores the need for hybrid assessment models that integrate computational efficiency with human insights . This research contributes to the evolving conversation around responsible AI adoption in creative disciplines, emphasizing the balance between automation and human judgment for scalable and pedagogically sound assessment practices. Design thinking is a human-centered approach to innovation that draws from the designer's toolkit to integrate the needs of people, the possibilities of technology, and the requirements for business success. It is a non - linear, iterative process that teams use to understand users, challenge assumptions, redefine problems, and create innovative solutions to prototype and test.
- Asia > Singapore (0.27)
- Europe > Netherlands > South Holland > Delft (0.04)
- Education > Educational Setting > Higher Education (1.00)
- Education > Assessment & Standards (1.00)
- Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.68)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.86)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Examining Student Interactions with a Pedagogical AI-Assistant for Essay Writing and their Impact on Students Writing Quality
Febriantoro, Wicaksono, Zhou, Qi, Suraworachet, Wannapon, Bulathwela, Sahan, Gauthier, Andrea, Millan, Eva, Cukurova, Mutlu
The dynamic nature of interactions between students and GenAI, as well as their relationship to writing quality, remains underexplored. While most research has examined how general-purpose GenAI can support writing, fewer studies have investigated how students interact with pedagogically designed systems across different phases of the writing process. To address this gap, we evaluated a GenAI-driven essay-writing assistant (EWA) designed to support higher education students in argumentative writing. Drawing on 1,282 interaction logs from 32 undergraduates during a two-hour writing session, Sequential Pattern Mining and K-Means clustering were used to identify behavioral patterns. Two clusters emerged: Cluster 1 emphasized outline planning and essay structure, while Cluster 2 focused on content development. A Mann-Whitney U test revealed a moderate effect size (r = 0.36) in the essay Organization dimension, with Cluster 1 showing higher scores. Qualitative analysis indicated that students with better performance actively wrote and shared essay sections with EWA for feedback, rather than interacted passively by asking questions. These findings suggest implications for teaching and system design. Teachers can encourage active engagement, while future EWAs may integrate automatic labeling and monitoring to prompt students to move from questioning to writing, enabling fuller benefits from GenAI-supported learning.
- North America > United States > Texas > Brazos County > College Station (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Middle East > Republic of Türkiye (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Education > Educational Setting (1.00)
- Education > Curriculum > Subject-Specific Education (1.00)
- Education > Educational Technology > Educational Software > Computer Based Training (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.87)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
PPTArena: A Benchmark for Agentic PowerPoint Editing
Ofengenden, Michael, Man, Yunze, Pang, Ziqi, Wang, Yu-Xiong
W e introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2,125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.
- Europe > Austria > Vienna (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- South America > Peru > Loreto Department (0.04)
- (4 more...)
ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment
Masters, Charlie, Grześkiewicz, Marta, Albrecht, Stefano V.
As agents based on large language models are increasingly deployed to long-horizon tasks, maintaining their alignment with stakeholder preferences becomes critical. Effective alignment in such settings requires reward models that are interpretable so that stakeholders can understand and audit model objectives. Moreover, reward models must be capable of steering agents at interaction time, allowing preference shifts to be incorporated without retraining. We introduce ARCANE, a framework that frames alignment as a multi-agent collaboration problem that dynamically represents stakeholder preferences as natural-language rubrics: weighted sets of verifiable criteria that can be generated on-the-fly from task context. Inspired by utility theory, we formulate rubric learning as a reconstruction problem and apply a regularized Group-Sequence Policy Optimization (GSPO) procedure that balances interpretability, faithfulness, and computational efficiency. Using a corpus of 219 labeled rubrics derived from the GDPV al benchmark, we evaluate ARCANE on challenging tasks requiring multi-step reasoning and tool use. The learned rubrics produce compact, legible evaluations and enable configurable trade-offs (e.g., correctness vs. conciseness) without retraining. Our results show that rubric-based reward models offer a promising path toward interpretable, test-time adaptive alignment for complex, long-horizon AI systems.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
LLM-based Automated Grading with Human-in-the-Loop
Chu, Yucheng, Li, Hang, Yang, Kaiqi, Copur-Gencturk, Yasemin, Tang, Jiliang
The rise of artificial intelligence (AI) technologies, particularly large language models (LLMs), has brought significant advancements to the field of education. Among various applications, automatic short answer grading (ASAG), which focuses on evaluating open-ended textual responses, has seen remarkable progress with the introduction of LLMs. These models not only enhance grading performance compared to traditional ASAG approaches but also move beyond simple comparisons with predefined "golden" answers, enabling more sophisticated grading scenarios, such as rubric-based evaluation. However, existing LLM-powered methods still face challenges in achieving human-level grading performance in rubric-based assessments due to their reliance on fully automated approaches. In this work, we explore the potential of LLMs in ASAG tasks by leveraging their interactive capabilities through a human-in-the-loop (HITL) approach. Our proposed framework, GradeHITL, utilizes the generative properties of LLMs to pose questions to human experts, incorporating their insights to refine grading rubrics dynamically. This adaptive process significantly improves grading accuracy, outperforming existing methods and bringing ASAG closer to human-level evaluation.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Michigan (0.05)
Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
Lee, Jinu, On, Kyoung-Woon, Han, Simeng, Cohan, Arman, Hockenmaier, Julia
Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented generation (RAG) and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.
- Europe > Austria > Vienna (0.14)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Asia > Singapore (0.04)
- (8 more...)