Generative AI
Ratas framework: A comprehensive genai-based approach to rubric-based marking of real-world textual exams
Safilian, Masoud, Beheshti, Amin, Elbourn, Stephen
Automated answer grading is a critical challenge in educational technology, with the potential to streamline assessment processes, ensure grading consistency, and provide timely feedback to students. However, existing approaches are often constrained to specific exam formats, lack interpretability in score assignment, and struggle with real-world applicability across diverse subjects and assessment types. To address these limitations, we introduce RATAS (Rubric Automated Tree-based Answer Scoring), a novel framework that leverages state-of-the-art generative AI models for rubric-based grading of textual responses. RATAS is designed to support a wide range of grading rubrics, enable subject-agnostic evaluation, and generate structured, explainable rationales for assigned scores. We formalize the automatic grading task through a mathematical framework tailored to rubric-based assessment and present an architecture capable of handling complex, real-world exam structures. To rigorously evaluate our approach, we construct a unique, contextualized dataset derived from real-world project-based courses, encompassing diverse response formats and varying levels of complexity. Empirical results demonstrate that RATAS achieves high reliability and accuracy in automated grading while providing interpretable feedback that enhances transparency for both students and nstructors.
More-than-Human Storytelling: Designing Longitudinal Narrative Engagements with Generative AI
Fabre, Émilie, Seaborn, Katie, Koiwai, Shuta, Watanabe, Mizuki, Riesch, Paul
Longitudinal engagement with generative AI (GenAI) storytelling agents is a timely but less charted domain. We explored multi-generational experiences with "Dreamsmithy," a daily dream-crafting app, where participants (N = 28) co-created stories with AI narrator "Makoto" every day. Reflections and interactions were captured through a two-week diary study. Reflexive thematic analysis revealed themes likes "oscillating ambivalence" and "socio-chronological bonding," highlighting the complex dynamics that emerged between individuals and the AI narrator over time. Findings suggest that while people appreciated the personal notes, opportunities for reflection, and AI creativity, limitations in narrative coherence and control occasionally caused frustration. The results underscore the potential of GenAI for longitudinal storytelling, but also raise critical questions about user agency and ethics. We contribute initial empirical insights and design considerations for developing adaptive, more-than-human storytelling systems.
A Mathematical Framework for AI-Human Integration in Work
Celis, L. Elisa, Huang, Lingxiao, Vishnoi, Nisheeth K.
The rapid rise of Generative AI (GenAI) tools has sparked debate over their role in complementing or replacing human workers across job contexts. We present a mathematical framework that models jobs, workers, and worker-job fit, introducing a novel decomposition of skills into decision-level and action-level subskills to reflect the complementary strengths of humans and GenAI. We analyze how changes in subskill abilities affect job success, identifying conditions for sharp transitions in success probability. We also establish sufficient conditions under which combining workers with complementary subskills significantly outperforms relying on a single worker. This explains phenomena such as productivity compression, where GenAI assistance yields larger gains for lower-skilled workers. We demonstrate the framework' s practicality using data from O*NET and Big-Bench Lite, aligning real-world data with our model via subskill-division methods. Our results highlight when and how GenAI complements human skills, rather than replacing them.
DeepSeek-R1 vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?
Larionov, Daniil, Takeshita, Sotaro, Zhang, Ran, Chen, Yanran, Leiter, Christoph, Wang, Zhipin, Greisinger, Christian, Eger, Steffen
Reasoning-enabled large language models (LLMs) excel in logical tasks, yet their utility for evaluating natural language generation remains unexplored. This study systematically compares reasoning LLMs with non-reasoning counterparts across machine translation and text summarization evaluation tasks. We evaluate eight models spanning state-of-the-art reasoning models (DeepSeek-R1, OpenAI o3), their distilled variants (8B-70B parameters), and equivalent non-reasoning LLMs. Experiments on WMT23 and SummEval benchmarks reveal architecture and task-dependent benefits: OpenAI o3-mini models show improved performance with increased reasoning on MT, while DeepSeek-R1 and generally underperforms compared to its non-reasoning variant except in summarization consistency evaluation. Correlation analysis demonstrates that reasoning token usage correlates with evaluation quality only in specific models, while almost all models generally allocate more reasoning tokens when identifying more quality issues. Distillation maintains reasonable performance up to 32B parameter models but degrades substantially at 8B scale. This work provides the first assessment of reasoning LLMs for NLG evaluation and comparison to non-reasoning models. We share our code to facilitate further research: https://github.com/NL2G/reasoning-eval.
Generative Knowledge Production Pipeline Driven by Academic Influencers
Feher, Katalin, Demeter, Marton
ABSTRACT Generative AI transforms knowledge production, validation, and dissemination, raising academic integrity and credibility concerns. This study examines 53 academic influencer videos that reached 5.3 million viewers to identify an emerging, structured, implementation-ready pipeline balancing originality, ethical compliance, and human-AI collaboration despite the disruptive impacts. Findings highlight generative AI's potential to automate publication workflows and democratize participation in knowledge production while challenging traditional scientific norms. Academic influencers emerge as key intermediaries in this paradigm shift, connecting bottom-up practices with institutional policies to improve adaptability. Accordingly, the study proposes a generative publication production pipeline and a policy framework for co-intelligence adaptation and reinforcing credibility-centered standards in AI-powered research. These insights support scholars, educators, and policymakers in understanding AI's transformative impact by advocating responsible and innovation-driven knowledge production. Additionally, they reveal pathways for automating best practices, optimizing scholarly workflows, and fostering creativity in academic research and publication. Keywords: generative AI, ChatPGT, academic integrity, influencers, knowledge production, social media, policy implications, academic policy 1. INTRODUCTION The advent of generative AI (GenAI) transforms knowledge production, increasingly supporting and partially automating the academic workflow (Bolanos et al. 2024). This trend suggests a paradigm shift where researchers utilize effectively and productively generative AI tools, potentially leading to more automated scientific workflows. However, we have also identified a human component in this process: the impact of the academic influencers via social media promoting hands-on knowledge about GenAI in academic projects.
Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors
Pedrotti, Andrea, Papucci, Michele, Ciaccio, Cristiano, Miaschi, Alessio, Puccetti, Giovanni, Dell'Orletta, Felice, Esuli, Andrea
Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.
Evaluating Gemini in an arena for learning
LearnLM Team, null, Modi, Abhinit, Veerubhotla, Aditya Srikanth, Rysbek, Aliya, Huber, Andrea, Anand, Ankit, Bhoopchand, Avishkar, Wiltshire, Brett, Gillick, Daniel, Kasenberg, Daniel, Sgouritsa, Eleni, Elidan, Gal, Liu, Hengrui, Winnemoeller, Holger, Jurenka, Irina, Cohan, James, She, Jennifer, Wilkowski, Julia, Alarakyia, Kaiz, McKee, Kevin R., Singh, Komal, Wang, Lisa, Kunesch, Markus, Pîslar, Miruna, Efron, Niv, Mahmoudieh, Parsa, Kamienny, Pierre-Alexandre, Wiltberger, Sara, Mohamed, Shakir, Agarwal, Shashank, Phal, Shubham Milind, Lee, Sun Jae, Strinopoulos, Theofilos, Ko, Wei-Jen, Gold-Zamir, Yael, Haramaty, Yael, Assael, Yannis
Artificial intelligence (AI) is poised to transform education, but the research community lacks a robust, general benchmark to evaluate AI models for learning. To assess state-of-the-art support for educational use cases, we ran an "arena for learning" where educators and pedagogy experts conduct blind, head-to-head, multi-turn comparisons of leading AI models. In particular, $N = 189$ educators drew from their experience to role-play realistic learning use cases, interacting with two models sequentially, after which $N = 206$ experts judged which model better supported the user's learning goals. The arena evaluated a slate of state-of-the-art models: Gemini 2.5 Pro, Claude 3.7 Sonnet, GPT-4o, and OpenAI o3. Excluding ties, experts preferred Gemini 2.5 Pro in 73.2% of these match-ups -- ranking it first overall in the arena. Gemini 2.5 Pro also demonstrated markedly higher performance across key principles of good pedagogy. Altogether, these results position Gemini 2.5 Pro as a leading model for learning.
Searching Clinical Data Using Generative AI
Hanswadkar, Karan, Kanchi, Anika, Tripathi, Shivani, Qiao, Shi, Chatterjee, Rony, Jindal, Alekh
Artificial Intelligence (AI) is making a major impact on healthcare, particularly through its application in natural language processing (NLP) and predictive analytics. The healthcare sector has increasingly adopted AI for tasks such as clinical data analysis and medical code assignment. However, searching for clinical information in large and often unorganized datasets remains a manual and error-prone process. Assisting this process with automations can help physicians improve their operational productivity significantly. In this paper, we present a generative AI approach, coined SearchAI, to enhance the accuracy and efficiency of searching clinical data. Unlike traditional code assignment, which is a one-to-one problem, clinical data search is a one-to-many problem, i.e., a given search query can map to a family of codes. Healthcare professionals typically search for groups of related diseases, drugs, or conditions that map to many codes, and therefore, they need search tools that can handle keyword synonyms, semantic variants, and broad open-ended queries. SearchAI employs a hierarchical model that respects the coding hierarchy and improves the traversal of relationships from parent to child nodes. SearchAI navigates these hierarchies predictively and ensures that all paths are reachable without losing any relevant nodes. To evaluate the effectiveness of SearchAI, we conducted a series of experiments using both public and production datasets. Our results show that SearchAI outperforms default hierarchical traversals across several metrics, including accuracy, robustness, performance, and scalability. SearchAI can help make clinical data more accessible, leading to streamlined workflows, reduced administrative burden, and enhanced coding and diagnostic accuracy.
Redefining Research Crowdsourcing: Incorporating Human Feedback with LLM-Powered Digital Twins
Chan, Amanda, Di, Catherine, Rupertus, Joseph, Smith, Gary, Rao, Varun Nagaraj, Ribeiro, Manoel Horta, Monroy-Hernández, Andrés
Crowd work platforms like Amazon Mechanical Turk and Prolific are vital for research, yet workers' growing use of generative AI tools poses challenges. Researchers face compromised data validity as AI responses replace authentic human behavior, while workers risk diminished roles as AI automates tasks. To address this, we propose a hybrid framework using digital twins, personalized AI models that emulate workers' behaviors and preferences while keeping humans in the loop. We evaluate our system with an experiment (n=88 crowd workers) and in-depth interviews with crowd workers (n=5) and social science researchers (n=4). Our results suggest that digital twins may enhance productivity and reduce decision fatigue while maintaining response quality. Both researchers and workers emphasized the importance of transparency, ethical data use, and worker agency. By automating repetitive tasks and preserving human engagement for nuanced ones, digital twins may help balance scalability with authenticity.
Emergent LLM behaviors are observationally equivalent to data leakage
Barrie, Christopher, Törnberg, Petter
Global convergence: Rapid convergence to a single, repeated action (a convention), maximizing joint and individual payoffs. Put simply, while the model does not explicitly identify this as a "naming game" setup, it does understand the basic structure of the scenario as well as optimal moves after success and what global convergence will look like. We conducted this analysis across a range of different LLMs. We then also used the OpenAI model gpt-4.1 to annotate three dimensions of the different LLM model outputs: whether it identified the setup as a coordination game; whether it correctly identified the optimal move; and whether it was able to correctly predict how the scenario would converge globally. We also asked the model to output the text snippet from the model output of the given LLM that the OpenAI model used to justify its decision.