gemini 1
VHELM: A Holistic Evaluation of Vision Language Models
Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs to present the Holistic Evaluation of Vision Language Models (VHELM). VHELM aggregates various datasets to cover one or more of the 9 aspects:,,,,,,,, and . In doing so, we produce a comprehensive, multi-dimensional view of the capabilities of the VLMs across these important factors.
Large Language Model-Based Generation of Discharge Summaries
Rodrigues, Tiago, Lopes, Carla Teixeira
Discharge Summaries are documents written by medical professionals that detail a patient's visit to a care facility. They contain a wealth of information crucial for patient care, and automating their generation could significantly reduce the effort required from healthcare professionals, minimize errors, and ensure that critical patient information is easily accessible and actionable. In this work, we explore the use of five Large Language Models on this task, from open-source models (Mistral, Llama 2) to proprietary systems (GPT-3, GPT-4, Gemini 1.5 Pro), leveraging MIMIC-III summaries and notes. We evaluate them using exact-match, soft-overlap, and reference-free metrics. Our results show that proprietary models, particularly Gemini with one-shot prompting, outperformed others, producing summaries with the highest similarity to the gold-standard ones. Open-source models, while promising, especially Mistral after fine-tuning, lagged in performance, often struggling with hallucinations and repeated information. Human evaluation by a clinical expert confirmed the practical utility of the summaries generated by proprietary models. Despite the challenges, such as hallucinations and missing information, the findings suggest that LLMs, especially proprietary models, are promising candidates for automatic discharge summary generation as long as data privacy is ensured.
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- Europe > Portugal > Porto > Porto (0.04)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)
- (6 more...)
Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models
Chatterjee, Sreejato, Tran, Linh, Nguyen, Quoc Duy, Kirson, Roni, Hamlin, Drue, Aquino, Harvest, Lyu, Hanjia, Luo, Jiebo, Dye, Timothy
Abstract--Traditional efforts to measure historical structural oppression struggle with cross-national validity due to the unique, locally specified histories of exclusion, colonization, and social status in each country, and often have relied on structured indices that privilege material resources while overlooking lived, identity-based exclusion. We introduce a novel framework for oppression measurement that leverages Large Language Models (LLMs) to generate context-sensitive scores of lived historical disadvantage across diverse geopolitical settings. Using unstructured self-identified ethnicity utterances from a multilingual COVID-19 global study, we design rule-guided prompting strategies that encourage models to produce interpretable, theoretically grounded estimations of oppression. We systematically evaluate these strategies across multiple state-of-the-art LLMs. Our results demonstrate that LLMs, when guided by explicit rules, can capture nuanced forms of identity-based historical oppression within nations. This approach provides a complementary measurement tool that highlights dimensions of systemic exclusion, offering a scalable, cross-cultural lens for understanding how oppression manifests in data-driven research and public health contexts. The study of racial and ethnic inequality remains central to sociological research, with extensive research documenting how structural oppression is reproduced in historical and contemporary contexts [1]-[3]. Oppression can be understood as a social hierarchy in which some groups subject other groups to lower status and to systemic exclusion, dehumanization, and disadvantage. In public health and sociology, this oppression is closely aligned with definitions of systemic and structural racism, which describe racism as deeply embedded in laws, policies, institutional practices, and social norms that sustain widespread inequities, violence, and disadvantage over time [1]. Foundational works have demonstrated how ethnic and national hierarchies shape access to power, life opportunities, autonomy, and sovereignty, for example, primarily through institutionalized mechanisms such as legal structures, educational systems, and healthcare access, among others [2].
- South America > Brazil (0.05)
- North America > United States > New York > Monroe County > Rochester (0.05)
- Africa > Middle East > Algeria (0.05)
- (13 more...)
VHELM: A Holistic Evaluation of Vision Language Models Tony Lee 1 Haoqin T u 2 Chi Heem Wong
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast. Our initial run evaluates 22 VLMs on 21 existing datasets to provide a holistic snapshot of the models. We uncover new key findings, such as the fact that efficiency-focused models (e.g., Claude 3 Haiku or Gemini 1.5 Flash) perform significantly
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- South America > Peru > Cusco Department > Cusco Province > Cusco (0.04)
- Asia > Japan (0.04)
- (3 more...)
- Health & Medicine (1.00)
- Law (0.67)
- Education > Educational Setting (0.46)
- Information Technology (1.00)
- Government (1.00)
- Law (0.92)
- Leisure & Entertainment > Sports > Soccer (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- North America > United States > California > Santa Clara County > Palo Alto (0.05)
- Oceania > Australia (0.04)
- North America > Montserrat (0.04)
- (2 more...)
- Information Technology (0.67)
- Government (0.67)
- Law > Intellectual Property & Technology Law (0.46)
- Law (1.00)
- Information Technology > Security & Privacy (0.46)
FairJudge: MLLM Judging for Social Attributes and Prompt Image Alignment
Sahili, Zahraa Al, Fetanat, Maryam, Nowaz, Maimuna, Patras, Ioannis, Purver, Matthew
Text-to-image (T2I) systems lack simple, reproducible ways to evaluate how well images match prompts and how models treat social attributes. Common proxies -- face classifiers and contrastive similarity -- reward surface cues, lack calibrated abstention, and miss attributes only weakly visible (for example, religion, culture, disability). We present FairJudge, a lightweight protocol that treats instruction-following multimodal LLMs as fair judges. It scores alignment with an explanation-oriented rubric mapped to [-1, 1]; constrains judgments to a closed label set; requires evidence grounded in the visible content; and mandates abstention when cues are insufficient. Unlike CLIP-only pipelines, FairJudge yields accountable, evidence-aware decisions; unlike mitigation that alters generators, it targets evaluation fairness. We evaluate gender, race, and age on FairFace, PaTA, and FairCoT; extend to religion, culture, and disability; and assess profession correctness and alignment on IdenProf, FairCoT-Professions, and our new DIVERSIFY-Professions. We also release DIVERSIFY, a 469-image corpus of diverse, non-iconic scenes. Across datasets, judge models outperform contrastive and face-centric baselines on demographic prediction and improve mean alignment while maintaining high profession accuracy, enabling more reliable, reproducible fairness audits.
- Europe > United Kingdom > England > Greater London > London (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Government (0.93)
- Retail (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)