aesthetics
The Good Robot podcast: what makes a drone "good"? with Beryl Pong
The Good Robot podcast: what makes a drone "good"? Hosted by Eleanor Drage and Kerry McInerney, The Good Robot is a podcast which explores the many complex intersections between gender, feminism and technology. What makes a drone "good"? In this episode, we talk to Beryl Pong, UKRI Future Leaders Fellow at the University of Cambridge, where she leads the Centre for Drones and Culture. Beryl reflects on what it means to think about drones as "good" or "ethical" technologies and how it can be assessed through its socio-political context.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
- North America > United States > Alaska (0.07)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.68)
- Information Technology > Artificial Intelligence > Games > Go (0.63)
- North America > United States > California > Santa Clara County > Palo Alto (0.15)
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Overview (1.00)
- Research Report > New Finding (0.46)
- Information Technology (0.92)
- Law > Intellectual Property & Technology Law (0.68)
- Health & Medicine > Therapeutic Area (0.46)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.49)
- North America > United States > California > Santa Clara County > Palo Alto (0.15)
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Overview (1.00)
- Research Report > New Finding (0.46)
- Information Technology (0.92)
- Law > Intellectual Property & Technology Law (0.68)
- Health & Medicine > Therapeutic Area (0.46)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.49)
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
Modern vision models are trained on very large noisy datasets. While these models acquire strong capabilities, they may not follow the user's intent to output the desired results in certain aspects, e.g., visual aesthetic, preferred style, and responsibility. In this paper, we target the realm of visual aesthetics and aim to align vision models with human aesthetic standards in a retrieval system. Advanced retrieval systems usually adopt a cascade of aesthetic models as re-rankers or filters, which are limited to low-level features like saturation and perform poorly when stylistic, cultural or knowledge contexts are involved. We find that utilizing the reasoning ability of large language models (LLMs) to rephrase the search query and extend the aesthetic expectations can make up for this shortcoming.
Beyond Aesthetics: Cultural Competence in Text-to-Image Models
Text-to-Image (T2I) models are being increasingly adopted in diverse global communities where they create visual representations of their unique cultures. Current T2I benchmarks primarily focus on faithfulness, aesthetics, and realism of generated images, overlooking the critical dimension of . In this work, we introduce a framework to evaluate cultural competence of T2I models along two crucial dimensions: cultural awareness and cultural diversity, and present a scalable approach using a combination of structured knowledge bases and large language models to build a large dataset of cultural artifacts to enable this evaluation. In particular, we apply this approach to build CUBE (CUltural BEnchmark for Text-to-Image models), a first-of-its-kind benchmark to evaluate cultural competence of T2I models. CUBE covers cultural artifacts associated with 8 countries across different geo-cultural regions and along 3 concepts: cuisine, landmarks, and art. CUBE consists of 1) CUBE-1K, a set of high-quality prompts that enable the evaluation of cultural awareness, and 2) CUBE-CSpace, a larger dataset of cultural artifacts that serves as grounding to evaluate cultural diversity. We also introduce cultural diversity as a novel T2I evaluation component, leveraging quality-weighted Vendi score. Our evaluations reveal significant gaps in the cultural awareness of existing models across countries and provide valuable insights into the cultural diversity of T2I outputs for underspecified prompts. Our methodology is extendable to other cultural regions and concepts and can facilitate the development of T2I models that better cater to the global population.
PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models
While reinforcement learning has advanced the alignment of text-to-image (T2I) models, state-of-the-art policy gradient methods are still hampered by training instability and high variance, hindering convergence speed and compromising image quality. Our analysis identifies a key cause of this instability: disproportionate credit assignment, in which the mathematical structure of the generative sampler produces volatile and non-proportional feedback across timesteps. To address this, we introduce Proportionate Credit Policy Optimization (PCPO), a framework that enforces proportional credit assignment through a stable objective reformulation and a principled reweighting of timesteps. This correction stabilizes the training process, leading to significantly accelerated convergence and superior image quality. The improvement in quality is a direct result of mitigating model collapse, a common failure mode in recursive training. PCPO substantially outperforms existing policy gradient baselines on all fronts, including the state-of-the-art DanceGRPO. Code is available at https://github.com/jaylee2000/pcpo/.
- North America > United States (0.04)
- North America > Dominican Republic (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards
In recent years, Image Quality Assessment (IQA) for AIgenerated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. W e introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. W e construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. W e apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.93)
From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
Chen, Yiming, Han, Junlin, Bai, Tianyi, Tong, Shengbang, Kokkinos, Filippos, Torr, Philip
While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model's alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but also transferable to downstream creative tasks. By integrating our cognitively-aligned MLLM into an image generation pipeline, we can guide the synthesis process to produce images that better embody desired traits, such as being more memorable or visually appealing. Our work provides a benchmark to measure this human-like perception, a post-training pipeline to enhance it, and a demonstration that this alignment unlocks more human-centric AI.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > United States > New York (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Asia > China (0.04)
- North America > Canada > Newfoundland and Labrador > Labrador (0.04)
- Asia (0.04)