Goto

Collaborating Authors

 Education


Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring

arXiv.org Artificial Intelligence

Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring Hong Jiao University of Maryland, College Park Hanna Choi University of Maryland, College Park Haowei Hua Princeton University Abstract This study explored the utilities of rationales generated by GPT-4.1 and GPT -5 in automated scoring using Prompt 6 essays from the 2012 Kaggle ASAP data . Essay-based scoring was compared with rationale-based scoring. The study found in general essay -based scoring performed better than rationale -based scoring with higher Quadratic Weighted Kappa (QWK). However, rationale-based scoring led to higher scoring accuracy in terms of F1 scores for score 0 which had less representation due to class imbalance issues . The ensemble modeling of essay-based scoring models increased the scoring accuracy at both specific score levels and across all score levels. The ensemble modeling of essay -based scoring and each of the rationale-based scoring performed about the same. Further ensemble of essay -based scoring and both rationale-based scoring yielded the best scoring accuracy with QWK of 0.870 compared with 0.848 reported in literature. Introduction Automated essay scoring methodology develops along with the advances in AI technology. Starting from the early supervised machine learning models based on engineered features ( e.g., Mahana et al., 2012) to recent use of large language models (LLMs), the methods for automated essay scoring as demonstrated in Appendix A evolved with the advances in machine learning, deep learning, language models, and LLMs. Using automated scoring of Prompt 6 in the Automated Student Assessment Prize (ASAP) dataset from Kaggle, this study intends to explore the utility of rationales generated by LLMs in enhancing automated essay scoring. For the ASAP Prompt 6, automated scoring models have been developed since 2012 after the Kaggle competition.


Characterizing Selective Refusal Bias in Large Language Models

arXiv.org Artificial Intelligence

Safety guardrails in large language models(LLMs) are developed to prevent malicious users from generating toxic content at a large scale. However, these measures can inadvertently introduce or reflect new biases, as LLMs may refuse to generate harmful content targeting some demographic groups and not others. We explore this selective refusal bias in LLM guardrails through the lens of refusal rates of targeted individual and intersectional demographic groups, types of LLM responses, and length of generated refusals. Our results show evidence of selective refusal bias across gender, sexual orientation, nationality, and religion attributes. This leads us to investigate additional safety implications via an indirect attack, where we target previously refused groups. Our findings emphasize the need for more equitable and robust performance in safety guardrails across demographic groups.


Detecting Data Contamination in LLMs via In-Context Learning

arXiv.org Artificial Intelligence

We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.


AIOT based Smart Education System: A Dual Layer Authentication and Context-Aware Tutoring Framework for Learning Environments

arXiv.org Artificial Intelligence

The AIoT-Based Smart Education System integrates Artificial Intelligence and IoT to address persistent challenges in contemporary classrooms: attendance fraud, lack of personalization, student disengagement, and inefficient resource use. The unified platform combines four core modules: (1) a dual-factor authentication system leveraging RFID-based ID scans and WiFi verification for secure, fraud-resistant attendance; (2) an AI-powered assistant that provides real-time, context-aware support and dynamic quiz generation based on instructor-supplied materials; (3) automated test generators to streamline adaptive assessment and reduce administrative overhead; and (4) the EcoSmart Campus module, which autonomously regulates classroom lighting, air quality, and temperature using IoT sensors and actuators. Simulated evaluations demonstrate the system's effectiveness in delivering robust real-time monitoring, fostering inclusive engagement, preventing fraudulent practices, and supporting operational scalability. Collectively, the AIoT-Based Smart Education System offers a secure, adaptive, and efficient learning environment, providing a scalable blueprint for future educational innovation and improved student outcomes through the synergistic application of artificial intelligence and IoT technologies.


Semantic Frame Aggregation-based Transformer for Live Video Comment Generation

arXiv.org Artificial Intelligence

Abstract--Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. T o address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SF A T) model for live video comment generation. This method not only leverages CLIP's visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with cross-attention mechanism to attend to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories,, we have constructed a large-scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SF A T model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts. IVE commenting on videos has become a popular feature in live streaming platforms such as Twitch, Y ouTube, Bilibili, Facebook and Instagram. Also known as "bullet screen" or "danmaku", it offers a dynamic and interactive experience, promoting engagement and conversations among viewers [1]-[3]. In contrast to traditional video comments, which neither reference specific moments in the video nor interact with one another, danmaku comments enable rich multimodal information interactions [4]. Y u is with Graduate School of Advanced Science and Engineering at Hiroshima University.


Predicting Household Water Consumption Using Satellite and Street View Images in Two Indian Cities

arXiv.org Artificial Intelligence

Monitoring household water use in rapidly urbanizing regions is hampered by costly, time-intensive enumeration methods and surveys. We investigate whether publicly available imagery-satellite tiles, Google Street View (GSV) segmentation-and simple geospatial covariates (nightlight intensity, population density) can be utilized to predict household water consumption in Hubballi-Dharwad, India. We compare four approaches: survey features (benchmark), CNN embeddings (satellite, GSV, combined), and GSV semantic maps with auxiliary data. Under an ordinal classification framework, GSV segmentation plus remote-sensing covariates achieves 0.55 accuracy for water use, approaching survey-based models (0.59 accuracy). Error analysis shows high precision at extremes of the household water consumption distribution, but confusion among middle classes is due to overlapping visual proxies. We also compare and contrast our estimates for household water consumption to that of household subjective income. Our findings demonstrate that open-access imagery, coupled with minimal geospatial data, offers a promising alternative to obtaining reliable household water consumption estimates using surveys in urban analytics.


Leveraging Foundation Models for Enhancing Robot Perception and Action

arXiv.org Artificial Intelligence

This thesis investigates how foundation models can be systematically leveraged to enhance robotic capabilities, enabling more effective localization, interaction, and manipulation in unstructured environments. The work is structured around four core lines of inquiry, each addressing a fundamental challenge in robotics while collectively contributing to a cohesive framework for semantics-aware robotic intelligence.


See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

arXiv.org Artificial Intelligence

Abstract--Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. T o generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner . Experimental results demonstrate that our method outperforms existing approaches on the HDTF, V oxCeleb, and A VSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input. UDIO-driven talking face generation aims to animate a target portrait image to create realistic talking videos given a driving audio speech. This technique finds wide application in various practical scenarios, including high-quality film and animation production, virtual assistants, interactive educational content creation, and realistic character animation. Recently, significant advancements have been made in this field with the development of generative models. Existing talking face generation methods mainly focus on creating animated videos from a reference portrait [1]-[5]. Still, there is a dilemma: users are concerned about privacy breaches when using real portrait images [6]. FaceChain [6] made the first attempt to liberate the source face and directly infer the synchronized portrait using disentangled identity features from speech. However, the generated virtual face fails to preserve identity consistency.


Offline Clustering of Preference Learning with Active-data Augmentation

arXiv.org Artificial Intelligence

Preference learning from pairwise feedback is a widely adopted framework in applications such as reinforcement learning with human feedback and recommendations. In many practical settings, however, user interactions are limited or costly, making offline preference learning necessary. Moreover, real-world preference learning often involves users with different preferences. For example, annotators from different backgrounds may rank the same responses differently. This setting presents two central challenges: (1) identifying similarity across users to effectively aggregate data, especially under scenarios where offline data is imbalanced across dimensions, and (2) handling the imbalanced offline data where some preference dimensions are underrepresented. To address these challenges, we study the Offline Clustering of Preference Learning problem, where the learner has access to fixed datasets from multiple users with potentially different preferences and aims to maximize utility for a test user. To tackle the first challenge, we first propose Off-C$^2$PL for the pure offline setting, where the learner relies solely on offline data. Our theoretical analysis provides a suboptimality bound that explicitly captures the tradeoff between sample noise and bias. To address the second challenge of inbalanced data, we extend our framework to the setting with active-data augmentation where the learner is allowed to select a limited number of additional active-data for the test user based on the cluster structure learned by Off-C$^2$PL. In this setting, our second algorithm, A$^2$-Off-C$^2$PL, actively selects samples that target the least-informative dimensions of the test user's preference. We prove that these actively collected samples contribute more effectively than offline ones. Finally, we validate our theoretical results through simulations on synthetic and real-world datasets.


Eliciting Secret Knowledge from Language Models

arXiv.org Artificial Intelligence

Model Organisms (MOs) research involves intentionally training models to exhibit specific failure modes, to serve as a testbed for study and development of mitigations (Hubinger et al., 2024; Denison et al., 2024; Marks et al., 2025). Prior work has introduced several types of MOs, including models that conceal capabilities unless a specific trigger is present in the input (Greenblatt et al., 2024b; van der Weij et al., 2025), fake alignment to evade safety measures (Greenblatt et al., 2024a), and display broad misalignment after being fine-tuned on a narrow distribution of harmful data (Bet-ley et al., 2025). The secret-keeping models trained in this work represent a novel class of MOs that refrain from revealing that they have certain factual knowledge. Auditing Language Models Our work contributes to the growing field of alignment auditing, which aims to systematically investigate whether a model pursues undesired or hidden objectives, rather than merely evaluating its surface-level behavior (Casper et al., 2024). A central methodology for validating such audits is to construct a testbed with a known ground truth, a principle applied in prior work (Schwettmann et al., 2023; Rager et al., 2025).