Goto

Collaborating Authors

 test session


Configurable multi-agent framework for scalable and realistic testing of llm-based agents

Wang, Sai, Subramanian, Senthilnathan, Sahni, Mudit, Gone, Praneeth, Meng, Lingjie, Wang, Xiaochen, Bertoli, Nicolas Ferradas, Cheng, Tingxian, Xu, Jun

arXiv.org Artificial Intelligence

Large-language-model (LLM) agents exhibit complex, context-sensitive behaviour that quickly renders static benchmarks and ad-hoc manual testing obsolete. We present Neo, a configurable, multi-agent framework that automates realistic, multi-turn evaluation of LLM-based systems. Neo couples a Question Generation Agent and an Evaluation Agent through a shared context-hub, allowing domain prompts, scenario controls and dynamic feedback to be composed modularly. Test inputs are sampled from a probabilistic state model spanning dialogue flow, user intent and emotional tone, enabling diverse, human-like conversations that adapt after every turn. Applied to a production-grade Seller Financial Assistant chatbot, Neo (i) uncovered edge-case failures across five attack categories with a 3.3% break rate close to the 5.8% achieved by expert human red-teamers, and (ii) delivered 10-12X higher throughput, generating 180 coherent test questions in around 45 mins versus 16h of human effort. Beyond security probing, Neo's stochastic policies balanced topic coverage and conversational depth, yielding broader behavioural exploration than manually crafted scripts. Neo therefore lays a foundation for scalable, self-evolving LLM QA: its agent interfaces, state controller and feedback loops are model-agnostic and extensible to richer factual-grounding and policy-compliance checks. We release the framework to facilitate reproducible, high-fidelity testing of emerging agentic systems.


AI-assisted Gaze Detection for Proctoring Online Exams

Shih, Yong-Siang, Zhao, Zach, Niu, Chenhao, Iberg, Bruce, Sharpnack, James, Baig, Mirza Basim

arXiv.org Artificial Intelligence

For high-stakes online exams, it is important to detect potential rule violations to ensure the security of the test. In this study, we investigate the task of detecting whether test takers are looking away from the screen, as such behavior could be an indication that the test taker is consulting external resources. For asynchronous proctoring, the exam videos are recorded and reviewed by the proctors. However, when the length of the exam is long, it could be tedious for proctors to watch entire exam videos to determine the exact moments when test takers look away. We present an AI-assisted gaze detection system, which allows proctors to navigate between different video frames and discover video frames where the test taker is looking in similar directions. The system enables proctors to work more effectively to identify suspicious moments in videos. An evaluation framework is proposed to evaluate the system against human-only and ML-only proctoring, and a user study is conducted to gather feedback from proctors, aiming to demonstrate the effectiveness of the system.


Evaluating the capability of large language models to personalize science texts for diverse middle-school-age learners

Vaccaro, Michael Jr, Friday, Mikayla, Zaghi, Arash

arXiv.org Artificial Intelligence

Evaluating the capability of large language models to personalize science texts for diverse middle-school-age learners Michael Vaccaro Jr. Abstract Large language models (LLMs), including OpenAI's GPT-series, have made significant advancements in recent years. Known for their expertise across diverse subject areas and quick adaptability to user-provided prompts, LLMs hold unique potential as Personalized Learning (PL) tools. Despite this potential, their application in K-12 education remains largely unexplored. This paper presents one of the first randomized controlled trials (n = 23) to evaluate the effectiveness of GPT-4 in personalizing educational science texts for middle school students. In this study, GPT-4 was used to profile student learning preferences based on choices made during a training session. For the experimental group, GPT-4 was used to rewrite science texts to align with the student's predicted profile while, for students in the control group, texts were rewritten to contradict their learning preferences. The results of a Mann-Whitney U test showed that students significantly preferred (at the.10 level) the rewritten texts when they were aligned with their profile (p =.059). These findings suggest that GPT-4 can effectively interpret and tailor educational content to diverse learner preferences, marking a significant advancement in PL technology. The limitations of this study and ethical considerations for using artificial intelligence in education are also discussed. Keywords: Large Language Models (LLMs), GPT-4, Personalized Learning, AI Generated Content (AIGC), Randomized Controlled Trial (RCT), K-12 Education 1 Introduction In 2008, the National Academy of Engineering named advancements in Personalized Learning (PL) one of the fourteen grand challenges for the twenty-first century (National Academy of Engineering, 2008). Since this time, PL has emerged as a prominent area of education research. Through this work, PL has evolved into a broad term which now encompasses a vast number of interventions and programs (Shemshack and Spector, 2020; Walkington and Bernacki, 2020). The work presented in this paper aims to build on this existing research by investigating the potential of novel Large Language Models (LLMs) to foster highly adaptive PL environments.


Human-in-the-Loop AI for Cheating Ring Detection

Shih, Yong-Siang, Liao, Manqian, Liu, Ruidong, Baig, Mirza Basim

arXiv.org Artificial Intelligence

Online exams have become popular in recent years due to their accessibility. However, some concerns have been raised about the security of the online exams, particularly in the context of professional cheating services aiding malicious test takers in passing exams, forming so-called "cheating rings". In this paper, we introduce a human-in-the-loop AI cheating ring detection system designed to detect and deter these cheating rings. We outline the underlying logic of this human-in-the-loop AI system, exploring its design principles tailored to achieve its objectives of detecting cheaters. Moreover, we illustrate the methodologies used to evaluate its performance and fairness, aiming to mitigate the unintended risks associated with the AI system. The design and development of the system adhere to Responsible AI (RAI) standards, ensuring that ethical considerations are integrated throughout the entire development process.

  Genre: Research Report (0.65)
  Industry: Education (0.96)

Dynamic In-Context Learning from Nearest Neighbors for Bundle Generation

Sun, Zhu, Feng, Kaidong, Yang, Jie, Qu, Xinghua, Fang, Hui, Ong, Yew-Soon, Liu, Wenyuan

arXiv.org Artificial Intelligence

Product bundling has evolved into a crucial marketing strategy in e-commerce. However, current studies are limited to generating (1) fixed-size or single bundles, and most importantly, (2) bundles that do not reflect consistent user intents, thus being less intelligible or useful to users. This paper explores two interrelated tasks, i.e., personalized bundle generation and the underlying intent inference based on users' interactions in a session, leveraging the logical reasoning capability of large language models. We introduce a dynamic in-context learning paradigm, which enables ChatGPT to seek tailored and dynamic lessons from closely related sessions as demonstrations while performing tasks in the target session. Specifically, it first harnesses retrieval augmented generation to identify nearest neighbor sessions for each target session. Then, proper prompts are designed to guide ChatGPT to perform the two tasks on neighbor sessions. To enhance reliability and mitigate the hallucination issue, we develop (1) a self-correction strategy to foster mutual improvement in both tasks without supervision signals; and (2) an auto-feedback mechanism to recurrently offer dynamic supervision based on the distinct mistakes made by ChatGPT on various neighbor sessions. Thus, the target session can receive customized and dynamic lessons for improved performance by observing the demonstrations of its neighbor sessions. Finally, experimental results on three real-world datasets verify the effectiveness of our methods on both tasks. Additionally, the inferred intents can prove beneficial for other intriguing downstream tasks, such as crafting appealing bundle names.