Instructional Material
Grade Like a Human: Rethinking Automated Assessment with Large Language Models
Xie, Wenjing, Niu, Juxin, Xue, Chun Jason, Guan, Nan
While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a particular step in the grading procedure: grading using predefined rubrics. However, grading is a multifaceted procedure that encompasses other crucial steps, such as grading rubrics design and post-grading review. There has been a lack of systematic research exploring the potential of LLMs to enhance the entire grading~process. In this paper, we propose an LLM-based grading system that addresses the entire grading procedure, including the following key components: 1) Developing grading rubrics that not only consider the questions but also the student answers, which can more accurately reflect students' performance. 2) Under the guidance of grading rubrics, providing accurate and consistent scores for each student, along with customized feedback. 3) Conducting post-grading review to better ensure accuracy and fairness. Additionally, we collected a new dataset named OS from a university operating system course and conducted extensive experiments on both our new dataset and the widely used Mohler dataset. Experiments demonstrate the effectiveness of our proposed approach, providing some new insights for developing automated grading systems based on LLMs.
Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions
Moore, Steven, Schmucker, Robin, Mitchell, Tom, Stamper, John
Knowledge Components (KCs) linked to assessments enhance the measurement of student learning, enrich analytics, and facilitate adaptivity. However, generating and linking KCs to assessment items requires significant effort and domain-specific knowledge. To streamline this process for higher-education courses, we employed GPT-4 to generate KCs for multiple-choice questions (MCQs) in Chemistry and E-Learning. We analyzed discrepancies between the KCs generated by the Large Language Model (LLM) and those made by humans through evaluation from three domain experts in each subject area. This evaluation aimed to determine whether, in instances of non-matching KCs, evaluators showed a preference for the LLM-generated KCs over their human-created counterparts. We also developed an ontology induction algorithm to cluster questions that assess similar KCs based on their content. Our most effective LLM strategy accurately matched KCs for 56% of Chemistry and 35% of E-Learning MCQs, with even higher success when considering the top five KC suggestions. Human evaluators favored LLM-generated KCs, choosing them over human-assigned ones approximately two-thirds of the time, a preference that was statistically significant across both domains. Our clustering algorithm successfully grouped questions by their underlying KCs without needing explicit labels or contextual information. This research advances the automation of KC generation and classification for assessment items, alleviating the need for student data or predefined KC labels.
OpenAI Forms Safety Committee as It Starts Training Latest AI Model
OpenAI says it's setting up a safety and security committee and has begun training a new AI model to supplant the GPT-4 system that underpins its ChatGPT chatbot. The San Francisco startup said in a blog post Tuesday that the committee will advise the full board on "critical safety and security decisions" for its projects and operations. The safety committee arrives as debate swirls around AI safety at the company, which was thrust into the spotlight after a researcher, Jan Leike, resigned and leveled criticism at OpenAI for letting safety "take a backseat to shiny products." OpenAI co-founder and chief scientist Ilya Sutskever also resigned, and the company disbanded the "superalignment" team focused on AI risks that they jointly led. Leike said Tuesday he's joining rival AI company Anthropic, founded by ex-OpenAI leaders, to "continue the superalignment mission" there.
Artificial Intelligence Index Report 2024
Maslej, Nestor, Fattorini, Loredana, Perrault, Raymond, Parli, Vanessa, Reuel, Anka, Brynjolfsson, Erik, Etchemendy, John, Ligett, Katrina, Lyons, Terah, Manyika, James, Niebles, Juan Carlos, Shoham, Yoav, Wald, Russell, Clark, Jack
The 2024 Index is our most comprehensive to date and arrives at an important moment when AI's influence on society has never been more pronounced. This year, we have broadened our scope to more extensively cover essential trends such as technical advancements in AI, public perceptions of the technology, and the geopolitical dynamics surrounding its development. Featuring more original data than ever before, this edition introduces new estimates on AI training costs, detailed analyses of the responsible AI landscape, and an entirely new chapter dedicated to AI's impact on science and medicine. The AI Index report tracks, collates, distills, and visualizes data related to artificial intelligence (AI). Our mission is to provide unbiased, rigorously vetted, broadly sourced data in order for policymakers, researchers, executives, journalists, and the general public to develop a more thorough and nuanced understanding of the complex field of AI. The AI Index is recognized globally as one of the most credible and authoritative sources for data and insights on artificial intelligence. Previous editions have been cited in major newspapers, including the The New York Times, Bloomberg, and The Guardian, have amassed hundreds of academic citations, and been referenced by high-level policymakers in the United States, the United Kingdom, and the European Union, among other places. This year's edition surpasses all previous ones in size, scale, and scope, reflecting the growing significance that AI is coming to hold in all of our lives.
cryoSPHERE: Single-particle heterogeneous reconstruction from cryo EM
Ducrocq, Gabriel, Grunewald, Lukas, Westenhoff, Sebastian, Lindsten, Fredrik
The three-dimensional structure of a protein plays a key role in determining its function. Methods like AlphaFold have revolutionized protein structure prediction based only on the amino-acid sequence. However, proteins often appear in multiple different conformations, and it is highly relevant to resolve the full conformational distribution. Single-particle cryo-electron microscopy (cryo EM) is a powerful tool for capturing a large number of images of a given protein, frequently in different conformations (referred to as particles). The images are, however, very noisy projections of the protein, and traditional methods for cryo EM reconstruction are limited to recovering a single, or a few, conformations. In this paper, we introduce cryoSPHERE, a deep learning method that takes as input a nominal protein structure, e.g. from AlphaFold, learns how to divide it into segments, and how to move these as approximately rigid bodies to fit the different conformations present in the cryo EM dataset. This formulation is shown to provide enough constraints to recover meaningful reconstructions of single protein structures. This is illustrated in three examples where we show consistent improvements over the current state-of-the-art for heterogeneous reconstruction.
Exploring Fairness in Educational Data Mining in the Context of the Right to be Forgotten
Qian, Wei, Chen, Aobo, Zhao, Chenxu, Li, Yangyi, Huai, Mengdi
Student data, which is a critical component in EDM research, can contain personal information, such as age and gender, as well as academic performance and activity data from online learning systems [24]. By offering valuable insights into student learning, EDM supports the development of more effective educational practices and policies, ultimately improving student outcomes. One of the most popular techniques in the previous works is incorporating machine learning techniques, which has achieved remarkable success in discovering intricate structures within educational datasets. However, in recent years, concerns about the fairness of deploying algorithmic decision-making in the educational context have emerged [2, 22, 27, 49]. Particularly, machine learning models can produce biased and unfair outcomes for certain student groups, significantly affecting their educational opportunities and achievements. Given that the data empowering EDM research can often contain personally identifiable and other sensitive information, there has been increased attention to privacy protection in recent years [37, 43]. Additionally, privacy legislation such as the California Consumer Privacy Act [39] and the former Right to be Forgotten [17] has granted users the right to erase the impact of their sensitive information from the trained models to protect their privacy. One approach to protecting users' privacy involves enabling the trained machine learning model to entirely forget Both authors contributed equally to this research.
Analyzing Chat Protocols of Novice Programmers Solving Introductory Programming Tasks with ChatGPT
Scholl, Andreas, Schiffner, Daniel, Kiesler, Natalie
The increasing need for competent computing graduates proficient in programming, software development, and related technical competencies [Ca17] is one of the factors exacerbating pressure on higher education institutions to offer high quality, competency-based education [Ra21]. However, the latter requires extensive resources, mentoring, and, for example, formative feedback for learners, especially in introductory programming classes [Je22; Lo24]. This is due to the fact that novices experience a number of challenges in the process, which have been subject to extensive research in the past decades [Du86; Lu18; SS86]. Among them are cognitively demanding competencies [Ki20; Ki24], such as problem understanding, designing and writing algorithms, debugging, and understanding error messages [Du86; ER16; Ki20; Lu18; SS86]). Educators' expectations towards novice learners and what they can achieve in their first semester(s) seem to be too high and unrealistic [Lu16; Lu18; WCL07]. Moreover, the student-educator ratio in introductory programming classes keeps increasing in German higher education institutions, thereby limiting resources to provide feedback and hints, and adequately address heterogeneous prior knowledge and diverse educational biographies [Pe16; SB22].
Safety through Permissibility: Shield Construction for Fast and Safe Reinforcement Learning
Politowicz, Alexander, Mazumder, Sahisnu, Liu, Bing
Designing Reinforcement Learning (RL) solutions for real-life problems remains a significant challenge. A major area of concern is safety. "Shielding" is a popular technique to enforce safety in RL by turning user-defined safety specifications into safe agent behavior. However, these methods either suffer from extreme learning delays, demand extensive human effort in designing models and safe domains in the problem, or require pre-computation. In this paper, we propose a new permissibility-based framework to deal with safety and shield construction. Permissibility was originally designed for eliminating (non-permissible) actions that will not lead to an optimal solution to improve RL training efficiency. This paper shows that safety can be naturally incorporated into this framework, i.e. extending permissibility to include safety, and thereby we can achieve both safety and improved efficiency. Experimental evaluation using three standard RL applications shows the effectiveness of the approach.
Auxiliary Knowledge-Induced Learning for Automatic Multi-Label Medical Document Classification
Wang, Xindi, Mercer, Robert E., Rudzicz, Frank
The International Classification of Diseases (ICD) is an authoritative medical classification system of different diseases and conditions for clinical and management purposes. ICD indexing assigns a subset of ICD codes to a medical record. Since human coding is labour-intensive and error-prone, many studies employ machine learning to automate the coding process. ICD coding is a challenging task, as it needs to assign multiple codes to each medical document from an extremely large hierarchically organized collection. In this paper, we propose a novel approach for ICD indexing that adopts three ideas: (1) we use a multi-level deep dilated residual convolution encoder to aggregate the information from the clinical notes and learn document representations across different lengths of the texts; (2) we formalize the task of ICD classification with auxiliary knowledge of the medical records, which incorporates not only the clinical texts but also different clinical code terminologies and drug prescriptions for better inferring the ICD codes; and (3) we introduce a graph convolutional network to leverage the co-occurrence patterns among ICD codes, aiming to enhance the quality of label representations. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures.
Challenge-Device-Synthesis: A multi-disciplinary approach for the development of social innovation competences for students of Artificial Intelligence
Bilkis, Matías, Kohler, Joan Moya, Vilariño, Fernando
The advent of Artificial Intelligence is expected to imply profound changes in the short-term. It is therefore imperative for Academia, and particularly for the Computer Science scope, to develop cross-disciplinary tools that bond AI developments to their social dimension. To this aim, we introduce the Challenge-Device-Synthesis methodology (CDS), in which a specific challenge is presented to the students of AI, who are required to develop a device as a solution for the challenge. The device becomes the object of study for the different dimensions of social transformation, and the conclusions addressed by the students during the discussion around the device are presented in a synthesis piece in the shape of a 10-page scientific paper. The latter is evaluated taking into account both the depth of analysis and the level to which it genuinely reflects the social transformations associated with the proposed AI-based device. We provide data obtained during the pilot for the implementation phase of CDS within the subject of Social Innovation, a 6-ECTS subject from the 6th semester of the Degree of Artificial Intelligence, UAB-Barcelona. We provide details on temporalisation, task distribution, methodological tools used and assessment delivery procedure, as well as qualitative analysis of the results obtained.