rater
- North America > United States > Washington (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- North America > Dominican Republic (0.04)
- Health & Medicine (1.00)
- Education > Educational Setting (0.46)
- Leisure & Entertainment > Games (0.46)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- Information Technology > Communications > Mobile (0.71)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
- Information Technology > Human Computer Interaction (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Research Report > Experimental Study (0.98)
- Research Report > New Finding (0.70)
- Personal (0.68)
- Law (1.00)
- Information Technology > Security & Privacy (0.69)
- Asia > India (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States (0.14)
- South America > Brazil (0.04)
- North America > Mexico (0.04)
- (10 more...)
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Rho-Perfect: Correlation Ceiling For Subjective Evaluation Datasets
ABSTRACT Subjective ratings contain inherent noise that limits the model-human correlation, but this reliability issue is rarely quantified. In this paper, we present ρ-Perfect, a practical estimation of the highest achievable correlation of a model on subjectively rated datasets. We define ρ-Perfect to be the correlation between a perfect predictor and human ratings, and derive an estimate of the value based on heteroscedastic noise scenarios, a common occurrence in subjectively rated datasets. We show that ρ-Perfect squared estimates test-retest correlation and use this to validate the estimate. We demonstrate the use of ρ-Perfect on a speech quality dataset and show how the measure can distinguish between model limitations and data quality issues.
- Europe > Sweden (0.40)
- North America > United States > Iowa > Johnson County > Iowa City (0.14)
- North America > United States > Massachusetts > Middlesex County > Reading (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability
Oikarinen, Tuomas, Yan, Ge, Kulkarni, Akshay, Weng, Tsui-Wei
Interpreting individual neurons or directions in activation space is an important topic in mechanistic interpretability. Numerous automated interpretability methods have been proposed to generate such explanations, but it remains unclear how reliable these explanations are, and which methods produce the most accurate descriptions. While crowd-sourced evaluations are commonly used, existing pipelines are noisy, costly, and typically assess only the highest-activating inputs, leading to unreliable results. In this paper, we introduce two techniques to enable cost-effective and accurate crowdsourced evaluation of automated interpretability methods beyond top activating inputs. First, we propose Model-Guided Importance Sampling (MG-IS) to select the most informative inputs to show human raters. In our experiments, we show this reduces the number of inputs needed to reach the same evaluation accuracy by ~13x. Second, we address label noise in crowd-sourced ratings through Bayesian Rating Aggregation (BRAgg), which allows us to reduce the number of ratings per input required to overcome noise by ~3x. Together, these techniques reduce the evaluation cost by ~40x, making large-scale evaluation feasible. Finally, we use our methods to conduct a large scale crowd-sourced study comparing recent automated interpretability methods for vision networks.
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.88)
Stable diffusion models reveal a persisting human and AI gap in visual creativity
Rondini, Silvia, Alvarez-Martin, Claudia, Angermair-Barkai, Paula, Penacchio, Olivier, Paz, M., Pelowski, Matthew, Dediu, Dan, Rodriguez-Fornells, Antoni, Cerda-Company, Xim
While recent research suggests Large Language Models match human creative performance in divergent thinking tasks, visual creativity remains underexplored. This study compared image generation in human participants (Visual Artists and Non Artists) and using an image generation AI model (two prompting conditions with varying human input: high for Human Inspired, low for Self Guided). Human raters (N=255) and GPT4o evaluated the creativity of the resulting images. We found a clear creativity gradient, with Visual Artists being the most creative, followed by Non Artists, then Human Inspired generative AI, and finally Self Guided generative AI. Increased human guidance strongly improved GenAI's creative output, bringing its productions close to those of Non Artists. Notably, human and AI raters also showed vastly different creativity judgment patterns. These results suggest that, in contrast to language centered tasks, GenAI models may face unique challenges in visual domains, where creativity depends on perceptual nuance and contextual sensitivity, distinctly human capacities that may not be readily transferable from language models.
- Europe > Austria > Vienna (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- (12 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)