Higher Education
LLMs' Reshaping of People, Processes, Products, and Society in Software Development: A Comprehensive Exploration with Early Adopters
Tabarsi, Benyamin, Reichert, Heidi, Limke, Ally, Kuttal, Sandeep, Barnes, Tiffany
Large language models (LLMs) like OpenAI ChatGPT, Google Gemini, and GitHub Copilot are rapidly gaining traction in the software industry, but their full impact on software engineering remains insufficiently explored. Despite their growing adoption, there is a notable lack of formal, qualitative assessments of how LLMs are applied in real-world software development contexts. To fill this gap, we conducted semi-structured interviews with sixteen early-adopter professional developers to explore their use of LLMs throughout various stages of the software development life cycle. Our investigation examines four dimensions: people - how LLMs affect individual developers and teams; process - how LLMs alter software engineering workflows; product - LLM impact on software quality and innovation; and society - the broader socioeconomic and ethical implications of LLM adoption. Thematic analysis of our data reveals that while LLMs have not fundamentally revolutionized the development process, they have substantially enhanced routine coding tasks, including code generation, refactoring, and debugging. Developers reported the most effective outcomes when providing LLMs with clear, well-defined problem statements, indicating that LLMs excel with decomposed problems and specific requirements. Furthermore, these early-adopters identified that LLMs offer significant value for personal and professional development, aiding in learning new languages and concepts. Early-adopters, highly skilled in software engineering and how LLMs work, identified early and persisting challenges for software engineering, such as inaccuracies in generated content and the need for careful manual review before integrating LLM outputs into production environments. Our study provides a nuanced understanding of how LLMs are shaping the landscape of software development, with their benefits, limitations, and ongoing implications.
WIP: Assessing the Effectiveness of ChatGPT in Preparatory Testing Activities
Haldar, Susmita, Pierce, Mary, Capretz, Luiz Fernando
This innovative practice WIP paper describes a research study that explores the integration of ChatGPT into the software testing curriculum and evaluates its effectiveness compared to human-generated testing artifacts. In a Capstone Project course, students were tasked with generating preparatory testing artifacts using ChatGPT prompts, which they had previously created manually. Their understanding and the effectiveness of the Artificial Intelligence generated artifacts were assessed through targeted questions. The results, drawn from this in-class assignment at a North American community college indicate that while ChatGPT can automate many testing preparation tasks, it cannot fully replace human expertise. However, students, already familiar with Information Technology at the postgraduate level, found the integration of ChatGPT into their workflow to be straightforward. The study suggests that AI can be gradually introduced into software testing education to keep pace with technological advancements.
OIPR: Evaluation for Time-series Anomaly Detection Inspired by Operator Interest
Jing, Yuhan, Wang, Jingyu, Zhang, Lei, Sun, Haifeng, He, Bo, Zhuang, Zirui, Wang, Chengsen, Qi, Qi, Liao, Jianxin
With the growing adoption of time-series anomaly detection (TAD) technology, numerous studies have employed deep learning-based detectors for analyzing time-series data in the fields of Internet services, industrial systems, and sensors. The selection and optimization of anomaly detectors strongly rely on the availability of an effective performance evaluation method for TAD. Since anomalies in time-series data often manifest as a sequence of points, conventional metrics that solely consider the detection of individual point are inadequate. Existing evaluation methods for TAD typically employ point-based or event-based metrics to capture the temporal context. However, point-based metrics tend to overestimate detectors that excel only in detecting long anomalies, while event-based metrics are susceptible to being misled by fragmented detection results. To address these limitations, we propose OIPR, a novel set of TAD evaluation metrics. It models the process of operators receiving detector alarms and handling faults, utilizing area under the operator interest curve to evaluate the performance of TAD algorithms. Furthermore, we build a special scenario dataset to compare the characteristics of different evaluation methods. Through experiments conducted on the special scenario dataset and five real-world datasets, we demonstrate the remarkable performance of OIPR in extreme and complex scenarios. It achieves a balance between point and event perspectives, overcoming their primary limitations and offering applicability to broader situations.
A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information
Susanto, Lucky, Wijanarko, Musa, Pratama, Prasetia, Tang, Zilu, Akyas, Fariz, Hong, Traci, Idris, Ika, Aji, Alham, Wijaya, Derry
Polarization is defined as divisive opinions held by two or more groups on substantive issues. As the world's third-largest democracy, Indonesia faces growing concerns about the interplay between political polarization and online toxicity, which is often directed at vulnerable minority groups. Despite the importance of this issue, previous NLP research has not fully explored the relationship between toxicity and polarization. To bridge this gap, we present a novel multi-label Indonesian dataset that incorporates toxicity, polarization, and annotator demographic information. Benchmarking this dataset using BERT-base models and large language models (LLMs) shows that polarization information enhances toxicity classification, and vice versa. Furthermore, providing demographic information significantly improves the performance of polarization classification.
Generative Artificial Intelligence for Academic Research: Evidence from Guidance Issued for Researchers by Higher Education Institutions in the United States
Ganguly, Amrita, Johri, Aditya, Ali, Areej, McDonald, Nora
To address these concerns, many Higher Education Institutions ( HEI s) have released institutional gui dance for researchers . To better understand the guidance that is being provided we report findings from a thematic analysis of guidelines from thirty HEIs in the United States that are classified as R1 or "very high research activity. " We found that guidance provided to researchers: 1) asks them to refer to external sources of information such as funding agencies and publishers to keep updated and use institutional resources for training and education; 2) asks them to understand and learn about specific GenAI attributes that shape research such as predictive modeling, knowledge cutoff date, data provenance, and model limitations, and about ethical concerns such as authorship, attribution, privacy, and intellectual property issues; 3) incl udes instructions on how to acknowledge sources and disclose the use of GenAI, and how to communicate effectively about their GenAI use, and alerts researchers to long term implications such as over reliance on GenAI, legal consequences, and risks to their institutions from GenAI use. Overall, g uidance places the onus of compliance on individual researchers making them accountable for any lapses, thereby increasing their responsibility. Keywords: Generative Artificial Intelligence; Academic Research, Thematic Analysis, Policy and Guidance, Qualitative Data Analysis, Framework 1 Introduction As the use of generative artificial intelligence (GenAI) increases across all facets of society, one area of significant impact is higher education institutions (HEIs). Although the initial scholarship on the use of GenAI within HEIs has focused on teaching and learning (McDonald et al., 202 5; Ali et al., 2025) increasingly, studies are starting to examine how academic research is being impacted by GenAI ( Abernethy, 2024; Lehr, et al., 2024; Lin, 2024; Liu and Jagadish, 2024; Godwin et al., 2024) This shift is in keeping with increased uptake of the use of GenAI for research. GenAI has many potential benefits for researchers across different stages of the research process such as data analysis, creation of content for research dissemination, and as a tool to brainstorm new ideas (Joosten et al., 2024) For instance, Delios et al. (2024) report that almost 30% of scientists are using GenAI as partners in their tasks related to research such as summarizing l iterature review, data analysis, grant writing and assisting with other aspects of manuscript preparation (Morocco - Clarke et al., 2024; Xames and Shefa, 2023). In a 2023 Nature survey of 1600 scientists, 30% acknowledged that they used GenAI to write acade mic papers, conduct literature reviews, and/or develop grant applications (Chawla, 2024).
Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals' Subjective Text Perceptions
Orlikowski, Matthias, Pei, Jiaxin, Rรถttger, Paul, Cimiano, Philipp, Jurgens, David, Hovy, Dirk
People naturally vary in their annotations for subjective questions and some of this variation is thought to be due to the person's sociodemographic characteristics. LLMs have also been used to label data, but recent work has shown that models perform poorly when prompted with sociodemographic attributes, suggesting limited inherent sociodemographic knowledge. Here, we ask whether LLMs can be trained to be accurate sociodemographic models of annotator variation. Using a curated dataset of five tasks with standardized sociodemographics, we show that models do improve in sociodemographic prompting when trained but that this performance gain is largely due to models learning annotator-specific behaviour rather than sociodemographic patterns. Across all tasks, our results suggest that models learn little meaningful connection between sociodemographics and annotation, raising doubts about the current use of LLMs for simulating sociodemographic variation and behaviour.
Transformers with Joint Tokens and Local-Global Attention for Efficient Human Pose Estimation
Kinfu, Kaleab A., Vidal, Renรฉ
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have led to significant progress in 2D body pose estimation. However, achieving a good balance between accuracy, efficiency, and robustness remains a challenge. For instance, CNNs are computationally efficient but struggle with long-range dependencies, while ViTs excel in capturing such dependencies but suffer from quadratic computational complexity. This paper proposes two ViT-based models for accurate, efficient, and robust 2D pose estimation. The first one, EViTPose, operates in a computationally efficient manner without sacrificing accuracy by utilizing learnable joint tokens to select and process a subset of the most important body patches, enabling us to control the trade-off between accuracy and efficiency by changing the number of patches to be processed. The second one, UniTransPose, while not allowing for the same level of direct control over the trade-off, efficiently handles multiple scales by combining (1) an efficient multi-scale transformer encoder that uses both local and global attention with (2) an efficient sub-pixel CNN decoder for better speed and accuracy. Moreover, by incorporating all joints from different benchmarks into a unified skeletal representation, we train robust methods that learn from multiple datasets simultaneously and perform well across a range of scenarios -- including pose variations, lighting conditions, and occlusions. Experiments on six benchmarks demonstrate that the proposed methods significantly outperform state-of-the-art methods while improving computational efficiency. EViTPose exhibits a significant decrease in computational complexity (30% to 44% less in GFLOPs) with a minimal drop of accuracy (0% to 3.5% less), and UniTransPose achieves accuracy improvements ranging from 0.9% to 43.8% across these benchmarks.
Learner and Instructor Needs in AI-Supported Programming Learning Tools: Design Implications for Features and Adaptive Control
Wu, Zihan, Tang, Yicheng, Ericson, Barbara
AI-supported tools can help learners overcome challenges in programming education by providing adaptive assistance. However, existing research often focuses on individual tools rather than deriving broader design recommendations. A key challenge in designing these systems is balancing learner control with system-driven guidance. To explore user preferences for AI-supported programming learning tools, we conducted a participatory design study with 15 undergraduate novice programmers and 10 instructors to gather insights on their desired help features and control preferences, as well as a follow-up survey with 172 introductory programming students. Our qualitative findings show that learners prefer help that is encouraging, incorporates visual aids, and includes peer-related insights, whereas instructors prioritize scaffolding that reflects learners' progress and reinforces best practices. Both groups favor shared control, though learners generally prefer more autonomy, while instructors lean toward greater system guidance to prevent cognitive overload. Additionally, our interviews revealed individual differences in control preferences. Based on our findings, we propose design guidelines for AI-supported programming tools, particularly regarding user-centered help features and adaptive control mechanisms. Our work contributes to the human-centered design of AI-supported learning environments by informing the development of systems that effectively balance autonomy and guidance, enhancing AI-supported educational tools for programming and beyond.
Experiences with Content Development and Assessment Design in the Era of GenAI
Sharma, Aakanksha, Shailendra, Samar, Kadel, Rajan
Generative Artificial Intelligence (GenAI) has the potential to transform higher education by generating human-like content. The advancement in GenAI has revolutionised several aspects of education, especially subject and assessment design. In this era, it is crucial to design assessments that challenge students and cannot be solved using GenAI tools. This makes it necessary to update the educational content with rapidly evolving technology. The assessment plays a significant role in ensuring the students learning, as it encourages students to engage actively, leading to the achievement of learning outcomes. The paper intends to determine how effectively GenAI can design a subject, including lectures, labs and assessments, using prompts and custom-based training. This paper aims to elucidate the direction to educators so they can leverage GenAI to create subject content. Additionally, we provided our experiential learning for educators to develop content, highlighting the importance of prompts and fine-tuning to ensure output quality. It has also been observed that expert evaluation is essential for assessing the quality of GenAI-generated materials throughout the content generation process.
MedSimAI: Simulation and Formative Feedback Generation to Enhance Deliberate Practice in Medical Education
Hicke, Yann, Geathers, Jadon, Rajashekar, Niroop, Chan, Colleen, Jack, Anyanate Gwendolyne, Sewell, Justin, Preston, Mackenzi, Cornes, Susannah, Shung, Dennis, Kizilcec, Rene
Medical education faces challenges in scalability, accessibility, and consistency, particularly in clinical skills training for physician-patient communication. Traditional simulation-based learning, while effective, is resource-intensive, difficult to schedule, and often highly variable in feedback quality. Through a collaboration between AI, learning science, and medical education experts, we co-developed MedSimAI, an AI-powered simulation platform that enables deliberate practice, self-regulated learning (SRL), and automated assessment through interactive patient encounters. Leveraging large language models (LLMs), MedSimAI generates realistic clinical interactions and provides immediate, structured feedback using established medical evaluation frameworks such as the Master Interview Rating Scale (MIRS). In a pilot study with 104 first-year medical students, we examined engagement, conversation patterns, and user perceptions. Students found MedSimAI beneficial for repeated, realistic patient-history practice. Conversation analysis revealed that certain higher-order skills were often overlooked, though students generally performed systematic histories and empathic listening. By integrating unlimited practice opportunities, real-time AI assessment, and SRL principles, MedSimAI addresses key limitations of traditional simulation-based training, making high-quality clinical education more accessible and scalable.