rater
- North America > United States > Washington (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- North America > Dominican Republic (0.04)
- Health & Medicine (1.00)
- Education > Educational Setting (0.46)
- Leisure & Entertainment > Games (0.46)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- Information Technology > Communications > Mobile (0.71)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
- Information Technology > Human Computer Interaction (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Research Report > Experimental Study (0.98)
- Research Report > New Finding (0.70)
- Personal (0.68)
- Law (1.00)
- Information Technology > Security & Privacy (0.69)
- Asia > India (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States (0.14)
- South America > Brazil (0.04)
- North America > Mexico (0.04)
- (10 more...)
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Rho-Perfect: Correlation Ceiling For Subjective Evaluation Datasets
ABSTRACT Subjective ratings contain inherent noise that limits the model-human correlation, but this reliability issue is rarely quantified. In this paper, we present ρ-Perfect, a practical estimation of the highest achievable correlation of a model on subjectively rated datasets. We define ρ-Perfect to be the correlation between a perfect predictor and human ratings, and derive an estimate of the value based on heteroscedastic noise scenarios, a common occurrence in subjectively rated datasets. We show that ρ-Perfect squared estimates test-retest correlation and use this to validate the estimate. We demonstrate the use of ρ-Perfect on a speech quality dataset and show how the measure can distinguish between model limitations and data quality issues.
- Europe > Sweden (0.40)
- North America > United States > Iowa > Johnson County > Iowa City (0.14)
- North America > United States > Massachusetts > Middlesex County > Reading (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- Asia > Middle East > Israel (0.05)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
A Principle-based Framework for the Development and Evaluation of Large Language Models for Health and Wellness
Winslow, Brent, Shreibati, Jacqueline, Perez, Javier, Su, Hao-Wei, Young-Lin, Nichole, Hammerquist, Nova, McDuff, Daniel, Guss, Jason, Vafeiadou, Jenny, Cain, Nick, Lin, Alex, Schenck, Erik, Rajagopal, Shiva, Chung, Jia-Ru, Venkatakrishnan, Anusha, Lee, Amy Armento, Karimzadehgan, Maryam, Meng, Qingyou, Agarwal, Rythm, Natarajan, Aravind, Giest, Tracy
The incorporation of generative artificial intelligence into personal health applications presents a transformative opportunity for personalized, data-driven health and fitness guidance, yet also poses challenges related to user safety, model accuracy, and personal privacy. To address these challenges, a novel, principle-based framework was developed and validated for the systematic evaluation of LLMs applied to personal health and wellness. First, the development of the Fitbit Insights explorer, a large language model (LLM)-powered system designed to help users interpret their personal health data, is described. Subsequently, the safety, helpfulness, accuracy, relevance, and personalization (SHARP) principle-based framework is introduced as an end-to-end operational methodology that integrates comprehensive evaluation techniques including human evaluation by generalists and clinical specialists, autorater assessments, and adversarial testing, into an iterative development lifecycle. Through the application of this framework to the Fitbit Insights explorer in a staged deployment involving over 13,000 consented users, challenges not apparent during initial testing were systematically identified. This process guided targeted improvements to the system and demonstrated the necessity of combining isolated technical evaluations with real-world user feedback. Finally, a comprehensive, actionable approach is established for the responsible development and deployment of LLM-powered health applications, providing a standardized methodology to foster innovation while ensuring emerging technologies are safe, effective, and trustworthy for users.
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)