content validity
Human and AI Trust: Trust Attitude Measurement Instrument
With the current progress of Artificial Intelligence (AI) technology and its increasingly broader applications, trust is seen as a required criterion for AI usage, acceptance, and deployment. A robust measurement instrument is essential to correctly evaluate trust from a human-centered perspective. This paper describes the development and validation process of a trust measure instrument, which follows psychometric principles, and consists of a 16-items trust scale. The instrument was built explicitly for research in human-AI interaction to measure trust attitudes towards AI systems from layperson (non-expert) perspective. The use-case we used to develop the scale was in the context of AI medical support systems (specifically cancer/health prediction). The scale development (Measurement Item Development) and validation (Measurement Item Evaluation) involved six research stages: item development, item evaluation, survey administration, test of dimensionality, test of reliability, and test of validity. The results of the six-stages evaluation show that the proposed trust measurement instrument is empirically reliable and valid for systematically measuring and comparing non-experts' trust in AI Medical Support Systems.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > China (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- Overview (1.00)
- Information Technology (1.00)
- Health & Medicine > Health Care Providers & Services (1.00)
- Education (1.00)
- (5 more...)
Comparing Human Expertise and Large Language Models Embeddings in Content Validity Assessment of Personality Tests
Milano, Nicola, Ponticorvo, Michela, Marocco, Davide
In this article we explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments, focusing on the Big Five Questionnaire (BFQ) and Big Five Inventory (BFI). Content validity, a cornerstone of test construction, ensures that psychological measures adequately cover their intended constructs. Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment. Graduate psychology students employed the Content Validity Ratio (CVR) to rate test items, forming the human baseline. In parallel, state-of-the-art LLMs, including multilingual and fine-tuned models, analyzed item embeddings to predict construct mappings. The results reveal distinct strengths and limitations of human and AI approaches. Human validators excelled in aligning the behaviorally rich BFQ items, while LLMs performed better with the linguistically concise BFI items. Training strategies significantly influenced LLM performance, with models tailored for lexical relationships outperforming general-purpose LLMs. Here we highlights the complementary potential of hybrid validation systems that integrate human expertise and AI precision. The findings underscore the transformative role of LLMs in psychological assessment, paving the way for scalable, objective, and robust test development methodologies.
The Use of Artificial Intelligence Tools in Assessing Content Validity: A Comparative Study with Human Experts
Gurdil, Hatice, Anadol, Hatice Ozlem, Soguksu, Yesim Beril
In this study, it was investigated whether AI evaluators assess the content validity of B1-level English reading comprehension test items in a manner similar to human evaluators. A 25-item multiple-choice test was developed, and these test items were evaluated by four human and four AI evaluators. No statistically significant difference was found between the scores given by human and AI evaluators, with similar evaluation trends observed. The Content Validity Ratio (CVR) and the Item Content Validity Index (I-CVI) were calculated and analyzed using the Wilcoxon Signed-Rank Test, with no statistically significant difference. The findings revealed that in some cases, AI evaluators could replace human evaluators. However, differences in specific items were thought to arise from varying interpretations of the evaluation criteria. Ensuring linguistic clarity and clearly defining criteria could contribute to more consistent evaluations. In this regard, the development of hybrid evaluation systems, in which AI technologies are used alongside human experts, is recommended.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > Republic of Türkiye > Ankara Province > Ankara (0.04)
- Asia > Middle East > Iraq > Kurdistan Region (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.86)
- Education > Educational Setting (1.00)
- Education > Focused Education > Gifted Children (0.68)
- Health & Medicine > Consumer Health (0.68)
- Education > Assessment & Standards > Student Performance (0.48)
Automatic Item Generation for Personality Situational Judgment Tests with Large Language Models
Li, Chang-Jin, Zhang, Jiyuan, Tang, Yun, Li, Jian
Personality assessment, particularly through situational judgment tests (SJTs), is a vital tool for psychological research, talent selection, and educational evaluation. This study explores the potential of GPT-4, a state-of-the-art large language model (LLM), to automate the generation of personality situational judgment tests (PSJTs) in Chinese. Traditional SJT development is labor-intensive and prone to biases, while GPT-4 offers a scalable, efficient alternative. Two studies were conducted: Study 1 evaluated the impact of prompt design and temperature settings on content validity, finding that optimized prompts with a temperature of 1.0 produced creative and accurate items. Study 2 assessed the psychometric properties of GPT-4-generated PSJTs, revealing that they demonstrated satisfactory reliability and validity, surpassing the performance of manually developed tests in measuring the Big Five personality traits. This research highlights GPT-4's effectiveness in developing high-quality PSJTs, providing a scalable and innovative method for psychometric test development. These findings expand the possibilities of automatic item generation and the application of LLMs in psychology, and offer practical implications for streamlining test development processes in resource-limited settings.
- Asia > China > Beijing > Beijing (0.05)
- Asia > China > Hubei Province > Wuhan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Singapore (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Research Report > Promising Solution (0.66)
- Law (0.92)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.68)
- Education > Assessment & Standards (0.67)
- Education > Educational Setting > Higher Education (0.46)
Measuring Trust for Exoskeleton Systems
Stirling, Leia, Wu, Man I, Peng, Xiangyu
Wearable robotic systems are a class of robots that have a tight coupling between human and robot movements. Similar to non-wearable robots, it is important to measure the trust a person has that the robot can support achieving the desired goals. While some measures of trust may apply to all potential robotic roles, there are key distinctions between wearable and non-wearable robotic systems. In this paper, we considered the dimensions and sub-dimensions of trust, with example attributes defined for exoskeleton applications. As the research community comes together to discuss measures of trust, it will be important to consider how the selected measures support interpreting trust along different dimensions for the variety of robotic systems that are emerging in the field in a way that leads to actionable outcomes.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Colorado > Boulder County > Boulder (0.05)
- North America > United States > Michigan (0.05)
Teranga Go!: Carpooling Collaborative Consumption Community with multi-criteria hesitant fuzzy linguistic term set opinions to build confidence and trust
Montes, Rosana, Sanchez, Ana M., Villar, Pedro, Herrera, Francisco
Classic Delphi and Fuzzy Delphi methods are used to test content validity of a data collection tools such as questionnaires. Fuzzy Delphi takes the opinion issued by judges from a linguistic perspective reducing ambiguity in opinions by using fuzzy numbers. We propose an extension named 2-Tuple Fuzzy Linguistic Delphi method to deal with scenarios in which judges show different expertise degrees by using fuzzy multigranular semantics of the linguistic terms and to obtain intermediate and final results expressed by 2-tuple linguistic values. The key idea of our proposal is to validate the full questionnaire by means of the evaluation of its parts, defining the validity of each item as a Decision Making problem. Taking the opinion of experts, we measure the degree of consensus, the degree of consistency, and the linguistic score of each item, in order to detect those items that affect, positively or negatively, the quality of the instrument. Considering the real need to evaluate a b-learning educational experience with a consensual questionnaire, we present a Decision Making model for questionnaire validation that solve it. Additionally, we contribute to this consensus reaching problem by developing an online tool under GPL v3 license. The software visualizes the collective valuations for each iteration and assists to determine which parts of the questionnaire should be modified to reach a consensual solution.
- South America > Argentina > Patagonia > Río Negro Province > Viedma (0.04)
- Europe > Spain > Andalusia > Granada Province > Granada (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
Using Analytics on Student Created Data to Content Validate Pedagogical Tools
Kos, John, Eaton, Kenneth, Zhang, Sareen, Dass, Rahul, Buckley, Stephen, An, Sungeun, Goel, Ashok
Conceptual and simulation models can function as useful pedagogical tools, however it is important to categorize different outcomes when evaluating them in order to more meaningfully interpret results. VERA is a ecology-based conceptual modeling software that enables users to simulate interactions between biotics and abiotics in an ecosystem, allowing users to form and then verify hypothesis through observing a time series of the species populations. In this paper, we classify this time series into common patterns found in the domain of ecological modeling through two methods, hierarchical clustering and curve fitting, illustrating a general methodology for showing content validity when combining different pedagogical tools. When applied to a diverse sample of 263 models containing 971 time series collected from three different VERA user categories: a Georgia Tech (GATECH), North Georgia Technical College (NGTC), and ``Self Directed Learners'', results showed agreement between both classification methods on 89.38\% of the sample curves in the test set. This serves as a good indication that our methodology for determining content validity was successful.
- North America > United States > Georgia > Fulton County > Atlanta (0.05)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Education (1.00)
- Health & Medicine (0.93)
Design and Assessment of a Bimanual Haptic Epidural Needle Insertion Simulator
Davidor, Nitsan, Binyamin, Yair, Hayuni, Tamar, Nisky, Ilana
The case experience of anesthesiologists is one of the leading causes of accidental dural punctures and failed epidurals - the most common complications of epidural analgesia used for pain relief during delivery. We designed a bimanual haptic simulator to train anesthesiologists and optimize epidural analgesia skill acquisition. We present an assessment study conducted with 22 anesthesiologists of different competency levels from several Israeli hospitals. Our simulator emulates the forces applied to the epidural (Touhy) needle, held by one hand, and those applied to the Loss of Resistance (LOR) syringe, held by the other one. The resistance is calculated based on a model of the epidural region layers parameterized by the weight of the patient. We measured the movements of both haptic devices and quantified the results' rate (success, failed epidurals, and dural punctures), insertion strategies, and the participants' answers to questionnaires about their perception of the simulation realism. We demonstrated good construct validity by showing that the simulator can distinguish between real-life novices and experts. Face and content validity were examined by studying users' impressions regarding the simulator's realism and fulfillment of purpose. We found differences in strategies between different level anesthesiologists, and suggest trainee-based instruction in advanced training stages.
- Asia > Middle East > Israel > Southern District > Beer-Sheva (0.04)
- South America > Brazil (0.04)
- North America > United States (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (0.88)
- Research Report > Strength High (0.67)
- Health & Medicine > Surgery (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.93)
5 Free Tools For Detecting ChatGPT, GPT3, and GPT2 - KDnuggets
After the launch of ChatGPT, the Pandora box opened. We are now observing a technological shift in the ways we do work. People are creating websites, apps, and even writing novels using ChatGPT. With all the hype and introduction of AI generative tools, we have seen a rise in bad actors. If you are following the latest news, you must have heard that ChatGPT has passed the Wharton MBA exam.
Virtual Reality Simulator for Fetoscopic Spina Bifida Repair Surgery
Korzeniowski, Przemysław, Płotka, Szymon, Brawura-Biskupski-Samaha, Robert, Sitek, Arkadiusz
Spina Bifida (SB) is a birth defect developed during the early stage of pregnancy in which there is incomplete closing of the spine around the spinal cord. The growing interest in fetoscopic Spina-Bifida repair, which is performed in fetuses who are still in the pregnant uterus, prompts the need for appropriate training. The learning curve for such procedures is steep and requires excellent procedural skills. Computer-based virtual reality (VR) simulation systems offer a safe, cost-effective, and configurable training environment free from ethical and patient safety issues. However, to the best of our knowledge, there are currently no commercial or experimental VR training simulation systems available for fetoscopic SB-repair procedures. In this paper, we propose a novel VR simulator for core manual skills training for SB-repair. An initial simulation realism validation study was carried out by obtaining subjective feedback (face and content validity) from 14 clinicians. The overall simulation realism was on average marked 4.07 on a 5-point Likert scale (1 - very unrealistic, 5 - very realistic). Its usefulness as a training tool for SB-repair as well as in learning fundamental laparoscopic skills was marked 4.63 and 4.80, respectively. These results indicate that VR simulation of fetoscopic procedures may contribute to surgical training without putting fetuses and their mothers at risk. It could also facilitate wider adaptation of fetoscopic procedures in place of much more invasive open fetal surgeries.
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- (3 more...)