Instructional Material
CSEPrompts: A Benchmark of Introductory Computer Science Prompts
Raihan, Nishat, Goswami, Dhiman, Puspo, Sadiya Sayara Chowdhury, Newman, Christian, Ranasinghe, Tharindu, Zampieri, Marcos
Recent advances in AI, machine learning, and NLP have led to the development of a new generation of Large Language Models (LLMs) that are trained on massive amounts of data and often have trillions of parameters. Commercial applications (e.g., ChatGPT) have made this technology available to the general public, thus making it possible to use LLMs to produce high-quality texts for academic and professional purposes. Schools and universities are aware of the increasing use of AI-generated content by students and they have been researching the impact of this new technology and its potential misuse. Educational programs in Computer Science (CS) and related fields are particularly affected because LLMs are also capable of generating programming code in various programming languages. To help understand the potential impact of publicly available LLMs in CS education, we introduce CSEPrompts, a framework with hundreds of programming exercise prompts and multiple-choice questions retrieved from introductory CS and programming courses. We also provide experimental results on CSEPrompts to evaluate the performance of several LLMs with respect to generating Python code and answering basic computer science and programming questions.
A Systems Theoretic Approach to Online Machine Learning
Preez, Anli du, Beling, Peter A., Cody, Tyler
The machine learning formulation of online learning is incomplete from a systems theoretic perspective. Typically, machine learning research emphasizes domains and tasks, and a problem solving worldview. It focuses on algorithm parameters, features, and samples, and neglects the perspective offered by considering system structure and system behavior or dynamics. Online learning is an active field of research and has been widely explored in terms of statistical theory and computational algorithms, however, in general, the literature still lacks formal system theoretical frameworks for modeling online learning systems and resolving systems-related concept drift issues. Furthermore, while the machine learning formulation serves to classify methods and literature, the systems theoretic formulation presented herein serves to provide a framework for the top-down design of online learning systems, including a novel definition of online learning and the identification of key design parameters. The framework is formulated in terms of input-output systems and is further divided into system structure and system behavior. Concept drift is a critical challenge faced in online learning, and this work formally approaches it as part of the system behavior characteristics. Healthcare provider fraud detection using machine learning is used as a case study throughout the paper to ground the discussion in a real-world online learning challenge.
Survey of Computerized Adaptive Testing: A Machine Learning Perspective
Liu, Qi, Zhuang, Yan, Bi, Haoyang, Huang, Zhenya, Huang, Weizhe, Li, Jiatong, Yu, Junhao, Liu, Zirui, Hu, Zirui, Hong, Yuting, Pardos, Zachary A., Ma, Haiping, Zhu, Mengxiao, Wang, Shijin, Chen, Enhong
Computerized Adaptive Testing (CAT) provides an efficient and tailored method for assessing the proficiency of examinees, by dynamically adjusting test questions based on their performance. Widely adopted across diverse fields like education, healthcare, sports, and sociology, CAT has revolutionized testing practices. While traditional methods rely on psychometrics and statistics, the increasing complexity of large-scale testing has spurred the integration of machine learning techniques. This paper aims to provide a machine learning-focused survey on CAT, presenting a fresh perspective on this adaptive testing method. By examining the test question selection algorithm at the heart of CAT's adaptivity, we shed light on its functionality. Furthermore, we delve into cognitive diagnosis models, question bank construction, and test control within CAT, exploring how machine learning can optimize these components. Through an analysis of current methods, strengths, limitations, and challenges, we strive to develop robust, fair, and efficient CAT systems. By bridging psychometric-driven CAT research with machine learning, this survey advocates for a more inclusive and interdisciplinary approach to the future of adaptive testing.
Pushing Buttons: When we don't know the true sales figures for consoles, players lose out
The outgoing boss of Sony's games division, Jim Ryan, who joined the company a few months before the launch of the original PlayStation, was interviewed by the official PlayStation podcast last week to mark his retirement. He talked up the PlayStation 5 as potentially Sony's "most successful ever console across multiple vectors" – interestingly, he did not specify what those vectors actually were. It would have to go some to beat the PlayStation 2's total of 160m – so far it's sold about 55m. As for that PlayStation 2 total: that's actually the first time we've heard it, in this podcast, in 2024, despite the fact that the PS2 was discontinued in 2013. The last official number we had for the PS2 was "more than 155m" as of March 2012, a number that's still quoted on Sony's own website.
Automated Inference of Graph Transformation Rules
Andersen, Jakob L., Davoodi, Akbar, Fagerberg, Rolf, Flamm, Christoph, Fontana, Walter, Kolčák, Juri, Laurent, Christophe V. F. P., Merkle, Daniel, Nøjgaard, Nikolai
The explosion of data available in life sciences is fueling an increasing demand for expressive models and computational methods. Graph transformation is a model for dynamic systems with a large variety of applications. We introduce a novel method of the graph transformation model construction, combining generative and dynamical viewpoints to give a fully automated data-driven model inference method. The method takes the input dynamical properties, given as a "snapshot" of the dynamics encoded by explicit transitions, and constructs a compatible model. The obtained model is guaranteed to be minimal, thus framing the approach as model compression (from a set of transitions into a set of rules). The compression is permissive to a lossy case, where the constructed model is allowed to exhibit behavior outside of the input transitions, thus suggesting a completion of the input dynamics. The task of graph transformation model inference is naturally highly challenging due to the combinatorics involved. We tackle the exponential explosion by proposing a heuristically minimal translation of the task into a well-established problem, set cover, for which highly optimized solutions exist. We further showcase how our results relate to Kolmogorov complexity expressed in terms of graph transformation.
The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education
Xu, Paiheng, Liu, Jing, Jones, Nathan, Cohen, Julie, Ai, Wei
Assessing instruction quality is a fundamental component of any improvement efforts in the education system. However, traditional manual assessments are expensive, subjective, and heavily dependent on observers' expertise and idiosyncratic factors, preventing teachers from getting timely and frequent feedback. Different from prior research that mostly focuses on low-inference instructional practices on a singular basis, this paper presents the first study that leverages Natural Language Processing (NLP) techniques to assess multiple high-inference instructional practices in two distinct educational settings: in-person K-12 classrooms and simulated performance tasks for pre-service teachers. This is also the first study that applies NLP to measure a teaching practice that is widely acknowledged to be particularly effective for students with special needs. We confront two challenges inherent in NLP-based instructional analysis, including noisy and long input data and highly skewed distributions of human ratings. Our results suggest that pretrained Language Models (PLMs) demonstrate performances comparable to the agreement level of human raters for variables that are more discrete and require lower inference, but their efficacy diminishes with more complex teaching practices. Interestingly, using only teachers' utterances as input yields strong results for student-centered variables, alleviating common concerns over the difficulty of collecting and transcribing high-quality student speech data in in-person teaching settings. Our findings highlight both the potential and the limitations of current NLP techniques in the education domain, opening avenues for further exploration.
DSGNN: A Dual-View Supergrid-Aware Graph Neural Network for Regional Air Quality Estimation
Zhang, Xin, Chen, Ling, Tang, Xing, Shi, Hongyu
Air quality estimation can provide air quality for target regions without air quality stations, which is useful for the public. Existing air quality estimation methods divide the study area into disjointed grid regions, and apply 2D convolution to model the spatial dependencies of adjacent grid regions based on the first law of geography, failing to model the spatial dependencies of distant grid regions. To this end, we propose a Dual-view Supergrid-aware Graph Neural Network (DSGNN) for regional air quality estimation, which can model the spatial dependencies of distant grid regions from dual views (i.e., satellite-derived aerosol optical depth (AOD) and meteorology). Specifically, images are utilized to represent the regional data (i.e., AOD data and meteorology data). The dual-view supergrid learning module is introduced to generate supergrids in a parameterized way. Based on the dual-view supergrids, the dual-view implicit correlation encoding module is introduced to learn the correlations between pairwise supergrids. In addition, the dual-view message passing network is introduced to implement the information interaction on the supergrid graphs and images. Extensive experiments on two real-world datasets demonstrate that DSGNN achieves the state-of-the-art performances on the air quality estimation task, outperforming the best baseline by an average of 19.64% in MAE.
Learning Equi-angular Representations for Online Continual Learning
Seo, Minhyuk, Koh, Hyunseo, Jeung, Wonje, Lee, Minjae, Kim, San, Lee, Hankook, Cho, Sungjun, Choi, Sungik, Kim, Hyunwoo, Choi, Jonghyun
Online continual learning suffers from an underfitted solution due to insufficient training for prompt model update (e.g., single-epoch training). To address the challenge, we propose an efficient online continual learning method using the neural collapse phenomenon. In particular, we induce neural collapse to form a simplex equiangular tight frame (ETF) structure in the representation space so that the continuously learned model with a single epoch can better fit to the streamed data by proposing preparatory data training and residual correction in the representation space. With an extensive set of empirical validations using CIFAR-10/100, TinyImageNet, ImageNet-200, and ImageNet-1K, we show that our proposed method outperforms state-of-the-art methods by a noticeable margin in various online continual learning scenarios such as disjoint and Gaussian scheduled continuous (i.e., boundary-free) data setups.
Peer-aided Repairer: Empowering Large Language Models to Repair Advanced Student Assignments
Zhao, Qianhui, Liu, Fang, Zhang, Li, Liu, Yang, Yan, Zhen, Chen, Zhenghao, Zhou, Yufei, Jiang, Jing, Li, Ge
Automated generation of feedback on programming assignments holds significant benefits for programming education, especially when it comes to advanced assignments. Automated Program Repair techniques, especially Large Language Model based approaches, have gained notable recognition for their potential to fix introductory assignments. However, the programs used for evaluation are relatively simple. It remains unclear how existing approaches perform in repairing programs from higher-level programming courses. To address these limitations, we curate a new advanced student assignment dataset named Defects4DS from a higher-level programming course. Subsequently, we identify the challenges related to fixing bugs in advanced assignments. Based on the analysis, we develop a framework called PaR that is powered by the LLM. PaR works in three phases: Peer Solution Selection, Multi-Source Prompt Generation, and Program Repair. Peer Solution Selection identifies the closely related peer programs based on lexical, semantic, and syntactic criteria. Then Multi-Source Prompt Generation adeptly combines multiple sources of information to create a comprehensive and informative prompt for the last Program Repair stage. The evaluation on Defects4DS and another well-investigated ITSP dataset reveals that PaR achieves a new state-of-the-art performance, demonstrating impressive improvements of 19.94% and 15.2% in repair rate compared to prior state-of-the-art LLM- and symbolic-based approaches, respectively
Automated Assessment of Encouragement and Warmth in Classrooms Leveraging Multimodal Emotional Features and ChatGPT
Hou, Ruikun, Fütterer, Tim, Bühler, Babette, Bozkir, Efe, Gerjets, Peter, Trautwein, Ulrich, Kasneci, Enkelejda
Classroom observation protocols standardize the assessment of teaching effectiveness and facilitate comprehension of classroom interactions. Whereas these protocols offer teachers specific feedback on their teaching practices, the manual coding by human raters is resource-intensive and often unreliable. This has sparked interest in developing AI-driven, cost-effective methods for automating such holistic coding. Our work explores a multimodal approach to automatically estimating encouragement and warmth in classrooms, a key component of the Global Teaching Insights (GTI) study's observation protocol. To this end, we employed facial and speech emotion recognition with sentiment analysis to extract interpretable features from video, audio, and transcript data. The prediction task involved both classification and regression methods. Additionally, in light of recent large language models' remarkable text annotation capabilities, we evaluated ChatGPT's zero-shot performance on this scoring task based on transcripts. We demonstrated our approach on the GTI dataset, comprising 367 16-minute video segments from 92 authentic lesson recordings. The inferences of GPT-4 and the best-trained model yielded correlations of r = .341 and r = .441 with human ratings, respectively. Combining estimates from both models through averaging, an ensemble approach achieved a correlation of r = .513, comparable to human inter-rater reliability. Our model explanation analysis indicated that text sentiment features were the primary contributors to the trained model's decisions. Moreover, GPT-4 could deliver logical and concrete reasoning as potential teacher guidelines. Our findings provide insights into using advanced, multimodal techniques for automated classroom observation, aiming to foster teacher training through frequent and valuable feedback.