heuristic evaluation
Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage
Platt, Nolan, Luchs, Ethan, Nizamani, Sehrish
Usability evaluations are essential for ensuring that modern interfaces meet user needs, yet traditional heuristic evaluations by human experts can be time-consuming and subjective, especially early in development. This paper investigates whether large language models (LLMs) can provide reliable and consistent heuristic assessments at the development stage. By applying Jakob Nielsen's ten usability heuristics to thirty open-source websites, we generated over 850 heuristic evaluations in three independent evaluations per site using a pipeline of OpenAI's GPT-4o. For issue detection, the model demonstrated moderate consistency, with an average pairwise Cohen's Kappa of 0.50 and an exact agreement of 84%. Severity judgments showed more variability: weighted Cohen's Kappa averaged 0.63, but exact agreement was just 56%, and Krippendorff's Alpha was near zero. These results suggest that while GPT-4o can produce internally consistent evaluations, especially for identifying the presence of usability issues, its ability to judge severity varies and requires human oversight in practice. Our findings highlight the feasibility and limitations of using LLMs for early-stage, automated usability testing, and offer a foundation for improving consistency in automated User Experience (UX) evaluation. To the best of our knowledge, our work provides one of the first quantitative inter-rater reliability analyses of automated heuristic evaluation and highlights methods for improving model consistency.
Synthetic Heuristic Evaluation: A Comparison between AI- and Human-Powered Usability Evaluation
Zhong, Ruican, McDonald, David W., Hsieh, Gary
Usability evaluation is crucial in human-centered design but can be costly, requiring expert time and user compensation. In this work, we developed a method for synthetic heuristic evaluation using multimodal LLMs' ability to analyze images and provide design feedback. Comparing our synthetic evaluations to those by experienced UX practitioners across two apps, we found our evaluation identified 73% and 77% of usability issues, which exceeded the performance of 5 experienced human evaluators (57% and 63%). Compared to human evaluators, the synthetic evaluation's performance maintained consistent performance across tasks and excelled in detecting layout issues, highlighting potential attentional and perceptual strengths of synthetic evaluation. However, synthetic evaluation struggled with recognizing some UI components and design conventions, as well as identifying across screen violations. Additionally, testing synthetic evaluations over time and accounts revealed stable performance. Overall, our work highlights the performance differences between human and LLM-driven evaluations, informing the design of synthetic heuristic evaluations.
Generative Students: Using LLM-Simulated Student Profiles to Support Question Item Evaluation
Evaluating the quality of automatically generated question items has been a long standing challenge. In this paper, we leverage LLMs to simulate student profiles and generate responses to multiple-choice questions (MCQs). The generative students' responses to MCQs can further support question item evaluation. We propose Generative Students, a prompt architecture designed based on the KLI framework. A generative student profile is a function of the list of knowledge components the student has mastered, has confusion about or has no evidence of knowledge of. We instantiate the Generative Students concept on the subject domain of heuristic evaluation. We created 45 generative students using GPT-4 and had them respond to 20 MCQs. We found that the generative students produced logical and believable responses that were aligned with their profiles. We then compared the generative students' responses to real students' responses on the same set of MCQs and found a high correlation. Moreover, there was considerable overlap in the difficult questions identified by generative students and real students. A subsequent case study demonstrated that an instructor could improve question quality based on the signals provided by Generative Students.
UX Heuristics and Checklist for Deep Learning powered Mobile Applications with Image Classification
von Wangenheim, Christiane Gresse, Dirschnabel, Gustavo
Advances in mobile applications providing image classification enabled by Deep Learning require innovative User Experience solutions in order to assure their adequate use by users. To aid the design process, usability heuristics are typically customized for a specific kind of application. Therefore, based on a literature review and analyzing existing mobile applications with image classification, we propose an initial set of AIX heuristics for Deep Learning powered mobile applications with image classification decomposed into a checklist. In order to facilitate the usage of the checklist we also developed an online course presenting the concepts and heuristics as well as a web-based tool in order to support an evaluation using these heuristics. These results of this research can be used to guide the design of the interfaces of such applications as well as support the conduction of heuristic evaluations supporting practitioners to develop image classification apps that people can understand, trust, and can engage with effectively.
Boosting Search Guidance in Problems with Semantic Attachments
Bernardini, Sara (Royal Holloway, University of London) | Fox, Maria (King's College London) | Long, Derek (King's College London) | Piacentini, Chiara (University of Toronto)
Most applications of planning to real problems involve complex and often non-linear equations, including matrix operations. PDDL is ill-suited to express such calculations since it only allows basic operations between numeric fluents. To remedy this restriction, a generic PDDL planner can be connected to a specialised advisor, which equips the planner with the ability to carry out sophisticated mathematical operations. Unlike related techniques based on semantic attachment, our planner is able to exploit an approximation of the numeric information calculated by the advisor to compute informative heuristic estimators. Guided by both causal and numeric information, our planning framework outperforms traditional approaches, especially against problems with numeric goals. We provide evidence of the power of our solution by successfully solving four completely different problems.
Planning with Numeric Timed Initial Fluents
Piacentini, Chiara (King's College London) | Fox, Maria (King's College London) | Long, Derek (King's College London)
Numeric Timed Initial Fluents represent a new feature in PDDL that extends the concept of Timed Initial Literals to numeric fluents. They are particularly useful to model independent functions that change through time and influence the actions to be applied. Although they are very useful to model real world problems, they are not systematically defined in the family of PDDL languages and they are not implemented in any generic PDDL planner, except for POPF2 and UPMurphi. In this paper we present an extension of the planner POPF2 (POPF-TIF) to handle problems with numeric Timed Initial Fluents. We propose and evaluate two contributions: the first is based on improvements of the heuristic evaluation, while the second considers alternative search algorithms based on a mixture of Enforced Hill Climbing and Best First Search.
Improving Delete Relaxation Heuristics Through Explicitly Represented Conjunctions
Keyder, E., Hoffmann, J., Haslum, P.
Heuristic functions based on the delete relaxation compute upper and lower bounds on the optimal delete-relaxation heuristic h+, and are of paramount importance in both optimal and satisficing planning. Here we introduce a principled and flexible technique for improving h+, by augmenting delete-relaxed planning tasks with a limited amount of delete information. This is done by introducing special fluents that explicitly represent conjunctions of fluents in the original planning task, rendering h+ the perfect heuristic h* in the limit. Previous work has introduced a method in which the growth of the task is potentially exponential in the number of conjunctions introduced. We formulate an alternative technique relying on conditional effects, limiting the growth of the task to be linear in this number. We show that this method still renders h+ the perfect heuristic h* in the limit. We propose techniques to find an informative set of conjunctions to be introduced in different settings, and analyze and extend existing methods for lower-bounding and upper-bounding h+ in the presence of conditional effects. We evaluate the resulting heuristic functions empirically on a set of IPC benchmarks, and show that they are sometimes much more informative than standard delete-relaxation heuristics.
Monte Carlo Tree Search with Heuristic Evaluations using Implicit Minimax Backups
Lanctot, Marc, Winands, Mark H. M., Pepels, Tom, Sturtevant, Nathan R.
Monte Carlo Tree Search (MCTS) has improved the performance of game engines in domains such as Go, Hex, and general game playing. MCTS has been shown to outperform classic alpha-beta search in games where good heuristic evaluations are difficult to obtain. In recent years, combining ideas from traditional minimax search in MCTS has been shown to be advantageous in some domains, such as Lines of Action, Amazons, and Breakthrough. In this paper, we propose a new way to use heuristic evaluations to guide the MCTS search by storing the two sources of information, estimated win rates and heuristic evaluations, separately. Rather than using the heuristic evaluations to replace the playouts, our technique backs them up implicitly during the MCTS simulations. These minimax values are then used to guide future simulations. We show that using implicit minimax backups leads to stronger play performance in Kalah, Breakthrough, and Lines of Action.
Heuristic Evaluation Based on Lifted Relaxed Planning Graphs
Ridder, Bram (King's College London) | Fox, Maria (King's College London)
In previous work we have shown that grounding, while used by most (if not all) modern state-of-the-art planners, is not necessary and is sometimes even undesirable. In this paper we extend this work and present a novel forward-chaining planner that does not require grounding and can solve problem instances that are too large for current planners to handle. We achieve this by exploiting equivalence relationships between objects whist constructing a lifted version of the relaxed planning graph (RPG) and extracting a relaxed plan. We compare our planner to FF and show that our approach consumes far less memory whist still being competitive. In addition we show that by not having to ground the domain we can solve much larger problem instances.
Defeasible Decisions: What the Proposal is and isn't
In two recent papers, I have proposed a description of decision analysis that differs from the Bayesian picture painted by Savage, Jeffrey and other classic authors. Response to this view has been either overly enthusiastic or unduly pessimistic. In this paper I try to place the idea in its proper place, which must be somewhere in between. Looking at decision analysis as defeasible reasoning produces a framework in which planning and decision theory can be integrated, but work on the details has barely begun. It also produces a framework in which the meta-decision regress can be stopped in a reasonable way, but it does not allow us to ignore meta-level decisions. The heuristics for producing arguments that I have presented are only supposed to be suggestive; but they are not open to the egregious errors about which some have worried. And though the idea is familiar to those who have studied heuristic search, it is somewhat richer because the control of dialectic is more interesting than the deepening of search.