Instructional Material
EvalYaks: Instruction Tuning Datasets and LoRA Fine-tuned Models for Automated Scoring of CEFR B2 Speaking Assessment Transcripts
Scaria, Nicy, Kennedy, Silvester John Joseph, Latinovich, Thomas, Subramani, Deepak
Relying on human experts to evaluate CEFR speaking assessments in an e-learning environment creates scalability challenges, as it limits how quickly and widely assessments can be conducted. We aim to automate the evaluation of CEFR B2 English speaking assessments in e-learning environments from conversation transcripts. First, we evaluate the capability of leading open source and commercial Large Language Models (LLMs) to score a candidate's performance across various criteria in the CEFR B2 speaking exam in both global and India-specific contexts. Next, we create a new expert-validated, CEFR-aligned synthetic conversational dataset with transcripts that are rated at different assessment scores. In addition, new instruction-tuned datasets are developed from the English Vocabulary Profile (up to CEFR B2 level) and the CEFR-SP WikiAuto datasets. Finally, using these new datasets, we perform parameter efficient instruction tuning of Mistral Instruct 7B v0.2 to develop a family of models called EvalYaks. Four models in this family are for assessing the four sections of the CEFR B2 speaking exam, one for identifying the CEFR level of vocabulary and generating level-specific vocabulary, and another for detecting the CEFR level of text and generating level-specific text. EvalYaks achieved an average acceptable accuracy of 96%, a degree of variation of 0.35 levels, and performed 3 times better than the next best model. This demonstrates that a 7B parameter LLM instruction tuned with high-quality CEFR-aligned assessment data can effectively evaluate and score CEFR B2 English speaking assessments, offering a promising solution for scalable, automated language proficiency evaluation.
Variance reduction of diffusion model's gradients with Taylor approximation-based control variate
Jeha, Paul, Grathwohl, Will, Andersen, Michael Riis, Ek, Carl Henrik, Frellsen, Jes
Score-based models, trained with denoising score matching, are remarkably effective in generating high dimensional data. However, the high variance of their training objective hinders optimisation. We attempt to reduce it with a control variate, derived via a $k$-th order Taylor expansion on the training objective and its gradient. We prove an equivalence between the two and demonstrate empirically the effectiveness of our approach on a low dimensional problem setting; and study its effect on larger problems.
MedDiT: A Knowledge-Controlled Diffusion Transformer Framework for Dynamic Medical Image Generation in Virtual Simulated Patient
Li, Yanzeng, Zeng, Cheng, Zhang, Jinchao, Zhou, Jie, Zou, Lei
Medical education relies heavily on Simulated Patients (SPs) to provide a safe environment for students to practice clinical skills, including medical image analysis. However, the high cost of recruiting qualified SPs and the lack of diverse medical imaging datasets have presented significant challenges. To address these issues, this paper introduces MedDiT, a novel knowledge-controlled conversational framework that can dynamically generate plausible medical images aligned with simulated patient symptoms, enabling diverse diagnostic skill training. Specifically, MedDiT integrates various patient Knowledge Graphs (KGs), which describe the attributes and symptoms of patients, to dynamically prompt Large Language Models' (LLMs) behavior and control the patient characteristics, mitigating hallucination during medical conversation. Additionally, a well-tuned Diffusion Transformer (DiT) model is incorporated to generate medical images according to the specified patient attributes in the KG. In this paper, we present the capabilities of MedDiT through a practical demonstration, showcasing its ability to act in diverse simulated patient cases and generate the corresponding medical images. This can provide an abundant and interactive learning experience for students, advancing medical education by offering an immersive simulation platform for future healthcare professionals. The work sheds light on the feasibility of incorporating advanced technologies like LLM, KG, and DiT in education applications, highlighting their potential to address the challenges faced in simulated patient-based medical education.
Phrasing for UX: Enhancing Information Engagement through Computational Linguistics and Creative Analytics
This underscores the critical role of information as a precursor to knowledge, rather than knowledge itself (Zins, 2007; Frické, 2009). In digital environments, symbols, letters, words, and phrases have the potential to contribute to knowledge formation, necessitating effective communication and optimal information presentation for Information Systems (IS) success (Delone & McLean, 2003; Venkatesh & Bala, 2008; ISO, 2019). Engagement, defined as the emotional, cognitive, and behavioral connection between users and technological resources, has emerged as a key metric for evaluating user experience (UX), reflecting user interaction depth with a system (O'Brien et al., 2020; Attfield et al., 2011; O'Brien & Cairns, 2016). The digitization of communication through Information and Communication Technologies (ICT) has revolutionized information conveyance, demanding engaging and effective digital content to ensure successful knowledge transmission and user retention (Beaudry, 2005; Dvir, 2018). Information Engagement (IE) has gained prominence, focusing on the quality of user-system interactions and the impact of digital content design on user decision-making and UX (ISO, 2019; O'Brien, 2020). IE is crucial in enhancing user interactions across domains such as education, government, and industry, aiming to foster meaningful user engagement with digital text (Choi et al., 2018; Feng et al., 2020; Han et al., 2022). Failure to achieve IE with digital text hinders content producers, yet overcoming this challenge is complicated by a lack of engaging information experience guidelines (Blythe, 2005; Overbeeke et al., 2003). Limited research on IE development has resulted in a scarcity of systematic approaches for its initiation, sustainment, and improvement (O'Brien, 2017; O'Brien & Toms, 2016).
SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging
Pourreza, Mohammadreza, Sun, Ruoxi, Li, Hailong, Miculicich, Lesly, Pfister, Tomas, Arik, Sercan O.
Text-to-SQL systems, which convert natural language queries into SQL commands, have seen significant progress primarily for the SQLite dialect. However, adapting these systems to other SQL dialects like BigQuery and PostgreSQL remains a challenge due to the diversity in SQL syntax and functions. We introduce SQL-GEN, a framework for generating high-quality dialect-specific synthetic data guided by dialect-specific tutorials, and demonstrate its effectiveness in creating training datasets for multiple dialects. Our approach significantly improves performance, by up to 20\%, over previous methods and reduces the gap with large-scale human-annotated datasets. Moreover, combining our synthetic data with human-annotated data provides additional performance boosts of 3.3\% to 5.6\%. We also introduce a novel Mixture of Experts (MoE) initialization method that integrates dialect-specific models into a unified system by merging self-attention layers and initializing the gates with dialect-specific keywords, further enhancing performance across different SQL dialects.
Mixstyle-Entropy: Domain Generalization with Causal Intervention and Perturbation
Tang, Luyao, Yuan, Yuxuan, Chen, Chaoqi, Ding, Xinghao, Huang, Yue
Despite the considerable advancements achieved by deep neural networks, their performance tends to degenerate when the test environment diverges from the training ones. Domain generalization (DG) solves this issue by learning representations independent of domain-related information, thus facilitating extrapolation to unseen environments. Existing approaches typically focus on formulating tailored training objectives to extract shared features from the source data. However, the disjointed training and testing procedures may compromise robustness, particularly in the face of unforeseen variations during deployment. In this paper, we propose a novel and holistic framework based on causality, named InPer, designed to enhance model generalization by incorporating causal intervention during training and causal perturbation during testing. Specifically, during the training phase, we employ entropy-based causal intervention (EnIn) to refine the selection of causal variables. To identify samples with anti-interference causal variables from the target domain, we propose a novel metric, homeostatic score, through causal perturbation (HoPer) to construct a prototype classifier in test time. Experimental results across multiple cross-domain tasks confirm the efficacy of InPer.
A Constraint Programming Approach to Fair High School Course Scheduling
Kiyohara, Mitsuka, Ishihata, Masakazu
Issues of inequity in U.S. high schools' course scheduling did not previously exist. However, in recent years, with the increase in student population and course variety, students perceive that the course scheduling method is unfair. Current integer programming (IP) methods to the high school scheduling problem (HSSP) fall short in addressing these fairness concerns. The purpose of this research is to develop a solution methodology that generates feasible and fair course schedules using student preferences. Utilizing principles of fairness, which have been well studied in market design, we define the fair high school scheduling problem (FHSSP), a novel extension to the HSSP, and devise a corresponding algorithm based on integer programming to solve the FHSSP. We test our approach on a real course request dataset from a high school in California, USA. Results show that our algorithm can generate schedules that are both feasible and fair. In this paper, we demonstrate that our IP algorithm not only solves the HSSP and FHSSP in the United States but has the potential to be applied to various real-world scheduling problems. Additionally, we show the feasibility of integrating human emotions into mathematical modeling.
Can AI be used ethically for school work? Here's what teachers say
Can AI be used ethically for school work? It depends upon who you ask -- quite literally. That's because less than two years after ChatGPT was originally released in November 2022, the attitudes towards AI in the classroom still vary widely. High schools have viewed AI as a crutch at best, and at worst as a tool for cheating. But several universities leave generative AI use entirely up to the discretion of the person teaching the course.
Accelerated training of deep learning surrogate models for surface displacement and flow, with application to MCMC-based history matching of CO2 storage operations
Han, Yifu, Hamon, Francois P., Durlofsky, Louis J.
Deep learning surrogate modeling shows great promise for subsurface flow applications, but the training demands can be substantial. Here we introduce a new surrogate modeling framework to predict CO2 saturation, pressure and surface displacement for use in the history matching of carbon storage operations. Rather than train using a large number of expensive coupled flow-geomechanics simulation runs, training here involves a large number of inexpensive flow-only simulations combined with a much smaller number of coupled runs. The flow-only runs use an effective rock compressibility, which is shown to provide accurate predictions for saturation and pressure for our system. A recurrent residual U-Net architecture is applied for the saturation and pressure surrogate models, while a new residual U-Net model is introduced to predict surface displacement. The surface displacement surrogate accepts, as inputs, geomodel quantities along with saturation and pressure surrogate predictions. Median relative error for a diverse test set is less than 4% for all variables. The surrogate models are incorporated into a hierarchical Markov chain Monte Carlo history matching workflow. Surrogate error is included using a new treatment involving the full model error covariance matrix. A high degree of prior uncertainty, with geomodels characterized by uncertain geological scenario parameters (metaparameters) and associated realizations, is considered. History matching results for a synthetic true model are generated using in-situ monitoring-well data only, surface displacement data only, and both data types. The enhanced uncertainty reduction achieved with both data types is quantified. Posterior saturation and surface displacement fields are shown to correspond well with the true solution.
Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models
Chen, Yuyan, Wu, Chenwei, Yan, Songzhou, Liu, Panjun, Zhou, Haoyu, Xiao, Yanghua
Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs' capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored. In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles. Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl's taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.