AITopics | Bhatia, Mehar

Collaborating Authors

Bhatia, Mehar

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Societal Alignment Frameworks Can Improve LLM Alignment

Stańczak, Karolina, Meade, Nicholas, Bhatia, Mehar, Zhou, Hattie, Böttinger, Konstantin, Barnes, Jeremy, Stanley, Jason, Montgomery, Jessica, Zemel, Richard, Papernot, Nicolas, Chapados, Nicolas, Therien, Denis, Lillicrap, Timothy P., Marasović, Ana, Delacroix, Sylvie, Hadfield, Gillian K., Reddy, Siva

arXiv.org Artificial IntelligenceFeb-27-2025

Recent progress in large language models (LLMs) has focused on producing responses that meet human expectations and align with shared values - a process coined alignment. However, aligning LLMs remains challenging due to the inherent disconnect between the complexity of human values and the narrow nature of the technological approaches designed to address them. Current alignment methods often lead to misspecified objectives, reflecting the broader issue of incomplete contracts, the impracticality of specifying a contract between a model developer, and the model that accounts for every scenario in LLM alignment. In this paper, we argue that improving LLM alignment requires incorporating insights from societal alignment frameworks, including social, economic, and contractual alignment, and discuss potential solutions drawn from these domains. Given the role of uncertainty within societal alignment frameworks, we then investigate how it manifests in LLM alignment. We end our discussion by offering an alternative view on LLM alignment, framing the underspecified nature of its objectives as an opportunity rather than perfect their specification. Beyond technical improvements in LLM alignment, we discuss the need for participatory alignment interface designs.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2503.00069

Country:

North America > United States (1.00)
Asia (0.67)
North America > Canada > Quebec > Montreal (0.14)
(2 more...)

Genre:

Research Report (0.71)
Overview (0.46)
Instructional Material (0.46)

Industry:

Law (1.00)
Health & Medicine (1.00)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

Chiu, Yu Ying, Jiang, Liwei, Lin, Bill Yuchen, Park, Chan Young, Li, Shuyue Stella, Ravi, Sahithya, Bhatia, Mehar, Antoniak, Maria, Tsvetkov, Yulia, Shwartz, Vered, Choi, Yejin

arXiv.org Artificial IntelligenceOct-3-2024

To make large language models (LLMs) more helpful across diverse cultures, it is essential to have effective cultural knowledge benchmarks to measure and track our progress. Effective benchmarks need to be robust, diverse, and challenging. We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for effectively assessing LLMs' cultural knowledge, covering 45 global regions including the underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions - each verified by five independent annotators - span 17 diverse topics ranging from food preferences to greeting etiquettes. We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently. We find that LLMs are sensitive to such difference in setups (e.g., GPT-4o with 27.3% difference). Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%. Moreover, we find that LLMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to converge to a single answer. Our results also indicate that OpenAI GPT-4o substantially outperform other proprietary and open source models in questions related to all but one region (Oceania). Nonetheless, all models consistently underperform on questions related to South America and the Middle East.

benchmark, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2410.02677

Country:

South America (1.00)
North America (1.00)
Europe (1.00)
(2 more...)

Genre: Research Report > New Finding (0.87)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models

Bhatia, Mehar, Ravi, Sahithya, Chinchure, Aditya, Hwang, Eunjeong, Shwartz, Vered

arXiv.org Artificial IntelligenceJun-28-2024

Despite recent advancements in vision-language models, their performance remains suboptimal on images from non-western cultures due to underrepresentation in training datasets. Various benchmarks have been proposed to test models' cultural inclusivity, but they have limited coverage of cultures and do not adequately assess cultural diversity across universal as well as culture-specific local concepts. To address these limitations, we introduce the GlobalRG benchmark, comprising two challenging tasks: retrieval across universals and cultural visual grounding. The former task entails retrieving culturally diverse images for universal concepts from 50 countries, while the latter aims at grounding culture-specific concepts within images from 15 countries. Our evaluation across a wide range of models reveals that the performance varies significantly across cultures -- underscoring the necessity for enhancing multicultural understanding in vision-language models.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2407.00263

Country:

North America (1.00)
Europe (1.00)
Africa > Middle East (0.76)
Asia > Middle East > Qatar (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge

Chiu, Yu Ying, Jiang, Liwei, Antoniak, Maria, Park, Chan Young, Li, Shuyue Stella, Bhatia, Mehar, Ravi, Sahithya, Tsvetkov, Yulia, Shwartz, Vered, Choi, Yejin

arXiv.org Artificial IntelligenceApr-9-2024

Frontier large language models (LLMs) are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs' (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdated internet resources. Thus, they struggle to capture the intricacy, dynamics, and diversity of cultural norms. LLM-generated benchmarks are promising, yet risk propagating the same biases they are meant to measure. To synergize the creativity and expert cultural knowledge of human annotators and the scalability and standardizability of LLM-based automation, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build truly challenging evaluation dataset for assessing the multicultural knowledge of LLMs, while improving annotators' capabilities and experiences. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions, that modern LLMs fail at, in a gamified manner. Importantly, the increased level of AI assistance (e.g., LLM-generated revision hints) empowers users to create more difficult questions with enhanced perceived creativity of themselves, shedding light on the promises of involving heavier AI assistance in modern evaluation dataset creation procedures. Through a series of 1-hour workshop sessions, we gather CULTURALBENCH-V0.1, a compact yet high-quality evaluation dataset with users' red-teaming attempts, that different families of modern LLMs perform with accuracy ranging from 37.7% to 72.2%, revealing a notable gap in LLMs' multicultural proficiency.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2404.06664

Country:

Asia (0.29)
Europe (0.28)
North America > United States (0.28)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Industry: Education > Educational Setting > K-12 Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

GD-COMET: A Geo-Diverse Commonsense Inference Model

Bhatia, Mehar, Shwartz, Vered

arXiv.org Artificial IntelligenceOct-23-2023

With the increasing integration of AI into everyday life, it's becoming crucial to design AI systems that serve users from diverse backgrounds by making them culturally aware. In this paper, we present GD-COMET, a geo-diverse version of the COMET commonsense inference model. GD-COMET goes beyond Western commonsense knowledge and is capable of generating inferences pertaining to a broad range of cultures. We demonstrate the effectiveness of GD-COMET through a comprehensive human evaluation across 5 diverse cultures, as well as extrinsic evaluation on a geo-diverse task. The evaluation shows that GD-COMET captures and generates culturally nuanced commonsense knowledge, demonstrating its potential to benefit NLP applications across the board and contribute to making NLP more inclusive.

comet, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2310.15383

Country:

North America > United States (0.46)
North America > Canada (0.28)
Asia > Middle East > UAE (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.56)

Add feedback

Calling Out Bluff: Attacking the Robustness of Automatic Scoring Systems with Simple Adversarial Testing

Kumar, Yaman, Bhatia, Mehar, Kabra, Anubha, Li, Jessy Junyi, Jin, Di, Shah, Rajiv Ratn

arXiv.org Artificial IntelligenceJul-13-2020

A significant progress has been made in deep-learning based Automatic Essay Scoring (AES) systems in the past two decades. The performance commonly measured by the standard performance metrics like Quadratic Weighted Kappa (QWK), and accuracy points to the same. However, testing on common-sense adversarial examples of these AES systems reveal their lack of natural language understanding capability. Inspired by common student behaviour during examinations, we propose a task agnostic adversarial evaluation scheme for AES systems to test their natural language understanding capabilities and overall robustness.

deep learning, educational technology, perturbation, (23 more...)

arXiv.org Artificial Intelligence

2007.06796

Country: North America > United States (0.93)

Genre: Research Report > New Finding (0.46)

Industry:

Education > Assessment & Standards > Student Performance (0.90)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.47)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback