Goto

Collaborating Authors

 descriptiveness




LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences

arXiv.org Artificial Intelligence

Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (\eg, hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of recent LVLMs reveals no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on user priorities.


Geopolitical Parallax: Beyond Walter Lippmann Just After Large Language Models

arXiv.org Artificial Intelligence

Objectivity in journalism has long been contested, oscillating between ideals of neutral, fact-based reporting and the inevitability of subjective framing. With the advent of large language models (LLMs), these tensions are now mediated by algorithmic systems whose training data and design choices may themselves embed cultural or ideological biases. This study investigates geopolitical parallax-systematic divergence in news quality and subjectivity assessments-by comparing article-level embeddings from Chinese-origin (Qwen, BGE, Jina) and Western-origin (Snowflake, Granite) model families. We evaluate both on a human-annotated news quality benchmark spanning fifteen stylistic, informational, and affective dimensions, and on parallel corpora covering politically sensitive topics, including Palestine and reciprocal China-United States coverage. Using logistic regression probes and matched-topic evaluation, we quantify per-metric differences in predicted positive-class probabilities between model families. Our findings reveal consistent, non-random divergences aligned with model origin. In Palestine-related coverage, Western models assign higher subjectivity and positive emotion scores, while Chinese models emphasize novelty and descriptiveness. Cross-topic analysis shows asymmetries in structural quality metrics Chinese-on-US scoring notably lower in fluency, conciseness, technicality, and overall quality-contrasted by higher negative emotion scores. These patterns align with media bias theory and our distinction between semantic, emotional, and relational subjectivity, and extend LLM bias literature by showing that geopolitical framing effects persist in downstream quality assessment tasks. We conclude that LLM-based media evaluation pipelines require cultural calibration to avoid conflating content differences with model-induced bias.


Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions

arXiv.org Artificial Intelligence

Recent research increasingly focuses on training vision-language models (VLMs) with long, detailed image captions. However, small-scale VLMs often struggle to balance the richness of these captions with the risk of hallucinating content during fine-tuning. In this paper, we explore how well VLMs adapt to such captions. To quantify caption quality, we propose Decomposed NLI (DNLI), an evaluation framework that breaks down generated captions into individual propositions, assessing each in isolation. This fine-grained analysis reveals a critical balance between capturing descriptive details and preventing hallucinations. Our findings show that simply reducing caption complexity or employing standard data curation techniques does not effectively resolve this issue. To tackle this challenge, we introduce Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that automatically adapts training data with the model's existing knowledge and visual understanding. KnowAda minimizes hallucinations while preserving high descriptiveness. We validate this approach across several small-scale VLMs (up to 7B parameters) and dense caption datasets, demonstrating that KnowAda effectively balances hallucination reduction and descriptiveness. Our results show that KnowAda outperforms various baselines in both automatic metrics and human evaluations. We will release our code and models.


Interpreting Inflammation Prediction Model via Tag-based Cohort Explanation

arXiv.org Artificial Intelligence

One significant application is in nutrition science, where ML models can provide dietary recommendations, detect food quality and safety issues during production, and surveil public health and epidemiology. However, the complex and often opaque nature of these models presents challenges in understanding and trusting their predictions. To address these issues, explainability techniques have garnered considerable interest, aiming to make ML models more interpretable and transparent. Explainability can be approached from different perspectives, including local explanations that focus on individual predictions and global explanations that provide insights into the overall behavior of the model. However, there is a growing need for intermediate-level explanations that balance these two extremes, offering contextually relevant insights that are both comprehensive and specific (Sokol and Flach, 2020; Arrieta et al., 2020; Adadi and Berrada, 2018). Cohort explainability, also referred to as subgroup explainability, explains model predictions by analyzing groups of instances with shared characteristics and emerges as a promising solution to this challenge.


Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

arXiv.org Artificial Intelligence

Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics than the label, it will be penalized and optimized to prefer more concise expressions, referred to as conciseness optimization. In contrast, predictions that are more concise than labels lead to richness optimization. Such conflicting optimization directions could eventually result in the model generating general descriptions. In this work, we introduce Semipermeable MaxImum Likelihood Estimation (SMILE), which allows richness optimization while blocking conciseness optimization, thus encouraging the model to generate longer captions with more details. Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works.


From Probability to Consilience: How Explanatory Values Implement Bayesian Reasoning

arXiv.org Artificial Intelligence

Recent work in cognitive science has uncovered a diversity of explanatory values, or dimensions along which we judge explanations as better or worse. We propose a Bayesian account of how these values fit together to guide explanation. The resulting taxonomy provides a set of predictors for which explanations people prefer and shows how core values from psychology, statistics, and the philosophy of science emerge from a common mathematical framework. In addition to operationalizing the explanatory virtues associated with, for example, scientific argument-making, this framework also enables us to reinterpret the explanatory vices that drive conspiracy theories, delusions, and extremist ideologies. Intuitively, philosophically, and as seen in laboratory experiments, explanations are judged as better or worse on the basis of many different criteria. These explanatory values appear in early childhood [1, 2, 3, 4, 5] and their influence extends to some of the most sophisticated social knowledge formation processes we know [6]. We lack, however, an understanding of the origin of these values or an account of how they fit together to guide belief formation. The multiplicity of values also appears to conflict with Bayesian models of cognition, which speak solely in terms of degrees of beliefs and suggest we judge explanations as better or worse on the basis of a single quantity, the posterior likelihood (see Glossary). In this opinion, we show how to resolve these conflicts by arguing that previously-identified explanatory values capture different components of a full Bayesian calculation and, when considered together and weighed appropriately, implement Bayesian cognition. This framework shows how key explanatory values identified by laboratory experiments and philosophers of science--co-explanation, descriptiveness, precision, unification, power, and simplicity--emerge naturally from the mathematical structure of probabilistic inference, thereby reconciling them with Bayesian models of cognition [7, 8]. Second, it shows how these values combine to produce preferences for one explanation over another.


Improving latent variable descriptiveness with AutoGen

arXiv.org Machine Learning

Powerful generative models, particularly in Natural Language Modelling, are commonly trained by maximizing a variational lower bound on the data log likelihood. These models often suffer from poor use of their latent variable, with ad-hoc annealing factors used to encourage retention of information in the latent variable. We discuss an alternative and general approach to latent variable modelling, based on an objective that combines the data log likelihood as well as the likelihood of a perfect reconstruction through an autoencoder. Tying these together ensures by design that the latent variable captures information about the observations, whilst retaining the ability to generate well. Interestingly, though this approach is a priori unrelated to VAEs, the lower bound attained is identical to the standard VAE bound but with the addition of a simple pre-factor; thus, providing a formal interpretation of the commonly used, ad-hoc pre-factors in training VAEs.


SentiCap: Generating Image Descriptions with Sentiments

AAAI Conferences

The recent progress on image recognition and language modeling is making automatic description of image content a reality. However, stylized, non-factual aspects of the written description are missing from the current systems. One such style is descriptions with emotions, which is commonplace in everyday communication, and influences decision-making and interpersonal relationships. We design a system to describe an image with emotions, and present a model that automatically generates captions with positive or negative sentiments. We propose a novel switching recurrent neural network with word-level regularization, which is able to produce emotional image captions using only 2000+ training sentences containing sentiments. We evaluate the captions with different automatic and crowd-sourcing metrics. Our model compares favourably in common quality metrics for image captioning. In 84.6% of cases the generated positive captions were judged as being at least as descriptive as the factual captions. Of these positive captions 88% were confirmed by the crowd-sourced workers as having the appropriate sentiment.