AITopics | Li, Huihan

Collaborating Authors

Li, Huihan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Attributing Culture-Conditioned Generations to Pretraining Corpora

Li, Huihan, Goel, Arnav, He, Keyu, Ren, Xiang

arXiv.org Artificial IntelligenceDec-30-2024

Recent works show that these biases may stem from uneven cultural representation in pretraining corpora. This work investigates how pretraining leads to biased culture-conditioned generations by analyzing how models associate entities with cultures based on pretraining data patterns. Additionally, the model favors generating entities with extraordinarily high frequency regardless of the conditioned culture, reflecting biases toward frequent pretraining terms irrespective of relevance. Our findings reflect trends observed specifically within OLMo-7B's pretraining data and are limited to this dataset. We make no claims about whether these results reflect real-world conditions.] In open-ended generative tasks like narrative writing or dialogue, language models often show bias against marginalized social groups based on gender, race, or culture (Gallegos et al., 2024; Manvi et al., 2024; Li et al., 2024b). Cultural bias is particularly notable due to the vast number of cultures to account for. Cultures are often unevenly represented in the pretraining corpora, with some mentioned more frequently than others, irrespective of their real-world prevalence (Li et al., 2024a). Recent studies reveal that models favor entities (Naous et al., 2023) and opinions (Ryan et al., 2024) from frequently represented cultures in pretraining while showing inadequate knowledge and templated answers for less frequent ones (Li et al., 2024b). Such biases in culture-conditioned generations can be linked to studies showing that LLMs' memorization and generalization are constrained by pretraining data imbalances. Zhang et al. (2024) find that these imbalances cause models to overgeneralize to high-frequency knowledge, overshadowing lower-frequency knowledge.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.2076

Country:

South America (1.00)
Europe (1.00)
Asia > Middle East (1.00)
(2 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

Li, Huihan, Jiang, Liwei, Huang, Jena D., Kim, Hyunwoo, Santy, Sebastin, Sorensen, Taylor, Lin, Bill Yuchen, Dziri, Nouha, Ren, Xiang, Choi, Yejin

arXiv.org Artificial IntelligenceApr-26-2024

As the utilization of large language models (LLMs) has proliferated worldwide, it is crucial for them to have adequate knowledge and fair representation for diverse global cultures. In this work, we uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations, and extract symbols from these generations that are associated to each culture by the LLM. We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures. We also discover that LLMs have an uneven degree of diversity in the culture symbols, and that cultures from different geographic regions have different presence in LLMs' culture-agnostic generation. Our findings promote further research in studying the knowledge and fairness of global culture perception in LLMs. Code and Data can be found in: https://github.com/huihanlhh/Culture-Gen/

culture symbol, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2404.10199

Country:

South America (1.00)
Europe (1.00)
Asia > Middle East (1.00)
(2 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Leisure & Entertainment (0.68)
Media (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

In Search of the Long-Tail: Systematic Generation of Long-Tail Knowledge via Logical Rule Guided Search

Li, Huihan, Ning, Yuting, Liao, Zeyi, Wang, Siyuan, Li, Xiang Lorraine, Lu, Ximing, Brahman, Faeze, Zhao, Wenting, Choi, Yejin, Ren, Xiang

arXiv.org Artificial IntelligenceNov-13-2023

Since large language models have approached human-level performance on many tasks, it has become increasingly harder for researchers to find tasks that are still challenging to the models. Failure cases usually come from the long-tail distribution - data that an oracle language model could assign a probability on the lower end of its distribution. Current methodology such as prompt engineering or crowdsourcing are insufficient for creating long-tail examples because humans are constrained by cognitive bias. We propose a Logic-Induced-Knowledge-Search (LINK) framework for systematically generating long-tail knowledge statements. Grounded by a symbolic rule, we search for long-tail values for each variable of the rule by first prompting a LLM, then verifying the correctness of the values with a critic, and lastly pushing for the long-tail distribution with a reranker. With this framework we construct a dataset, Logic-Induced-Long-Tail (LINT), consisting of 200 symbolic rules and 50K knowledge statements spanning across four domains. Human annotations find that 84% of the statements in LINT are factually correct. In contrast, ChatGPT and GPT4 struggle with directly generating long-tail statements under the guidance of logic rules, each only getting 56% and 78% of their statements correct. Moreover, their "long-tail" generations in fact fall into the higher likelihood range, and thus are not really long-tail. Our findings suggest that LINK is effective for generating data in the long-tail distribution while enforcing quality. LINT can be useful for systematically evaluating LLMs' capabilities in the long-tail distribution. We challenge the models with a simple entailment classification task using samples from LINT. We find that ChatGPT and GPT4's capability in identifying incorrect knowledge drop by ~3% in the long-tail distribution compared to head distribution.

large language model, machine learning, search, (6 more...)

arXiv.org Artificial Intelligence

2311.07237

Genre: Research Report > New Finding (0.53)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Controllable Text Generation with Language Constraints

Chen, Howard, Li, Huihan, Chen, Danqi, Narasimhan, Karthik

arXiv.org Artificial IntelligenceDec-20-2022

We consider the task of text generation in language models with constraints specified in natural language. To this end, we first create a challenging benchmark Cognac that provides as input to the model a topic with example text, along with a constraint on text to be avoided. Unlike prior work, our benchmark contains knowledge-intensive constraints sourced from databases like Wordnet and Wikidata, which allows for straightforward evaluation while striking a balance between broad attribute-level and narrow lexical-level controls. We find that even state-of-the-art language models like GPT-3 fail often on this task, and propose a solution to leverage a language model's own internal knowledge to guide generation. Our method, called CognacGen, first queries the language model to generate guidance terms for a specified topic or constraint, and uses the guidance to modify the model's token generation probabilities. We propose three forms of guidance (binary verifier, top-k tokens, textual example), and employ prefix-tuning approaches to distill the guidance to tackle diverse natural language constraints. Through extensive empirical evaluations, we demonstrate that CognacGen can successfully generalize to unseen instructions and outperform competitive baselines in generating constraint conforming text.

artificial intelligence, constraint, natural language, (18 more...)

arXiv.org Artificial Intelligence

2212.10466

Country:

Europe (1.00)
North America > United States > California (0.28)

Genre: Research Report (0.82)

Industry:

Consumer Products & Services > Food, Beverage, Tobacco & Cannabis > Beverages (0.94)
Government > Regional Government > North America Government > United States Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Add feedback