Large Language Model
Can Demographic Factors Improve Text Classification? Revisiting Demographic Adaptation in the Age of Transformers
Hung, Chia-Chien, Lauscher, Anne, Hovy, Dirk, Ponzetto, Simone Paolo, Glavaš, Goran
Demographic factors (e.g., gender or age) shape our language. Previous work showed that incorporating demographic factors can consistently improve performance for various NLP tasks with traditional NLP models. In this work, we investigate whether these previous findings still hold with state-of-the-art pretrained Transformer-based language models (PLMs). We use three common specialization methods proven effective for incorporating external knowledge into pretrained Transformers (e.g., domain-specific or geographic knowledge). We adapt the language representations for the demographic dimensions of gender and age, using continuous language modeling and dynamic multi-task learning for adaptation, where we couple language modeling objectives with the prediction of demographic classes. Our results, when employing a multilingual PLM, show substantial gains in task performance across four languages (English, German, French, and Danish), which is consistent with the results of previous work. However, controlling for confounding factors - primarily domain and language proficiency of Transformer-based PLMs - shows that downstream performance gains from our demographic adaptation do not actually stem from demographic knowledge. Our results indicate that demographic specialization of PLMs, while holding promise for positive societal impact, still represents an unsolved problem for (modern) NLP.
A Simple, Yet Effective Approach to Finding Biases in Code Generation
Mouselinos, Spyridon, Malinowski, Mateusz, Michalewski, Henryk
Recently, high-performing code generation systems based on large language models have surfaced. They are trained on massive corpora containing much more natural text than actual executable computer code. This work shows that current code generation systems exhibit undesired biases inherited from their large language model backbones, which can reduce the quality of the generated code under specific circumstances. To investigate the effect, we propose the "block of influence" concept, which enables a modular decomposition and analysis of the coding challenges. We introduce an automated intervention mechanism reminiscent of adversarial testing that exposes undesired biases through the failure modes of the models under test. Finally, we demonstrate how our framework can be used as a data transformation technique during fine-tuning, acting as a mitigation strategy for these biases.
UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in Vietnamese
Bui, Doanh C., Nguyen, Nghia Hieu, Nguyen, Khang
Image Captioning is one of the vision-language tasks that still interest the research community worldwide in the 2020s. MS-COCO Caption benchmark is commonly used to evaluate the performance of advanced captioning models, although it was published in 2015. Recent captioning models trained on the MS-COCO Caption dataset only have good performance in language patterns of English; they do not have such good performance in contexts captured in Vietnam or fluently caption images using Vietnamese. To contribute to the low-resources research community as in Vietnam, we introduce a novel image captioning dataset in Vietnamese, the Open-domain Vietnamese Image Captioning dataset (UIT-OpenViIC). The introduced dataset includes complex scenes captured in Vietnam and manually annotated by Vietnamese under strict rules and supervision. In this paper, we present in more detail the dataset creation process. From preliminary analysis, we show that our dataset is challenging to recent state-of-the-art (SOTA) Transformer-based baselines, which performed well on the MS COCO dataset. Then, the modest results prove that UIT-OpenViIC has room to grow, which can be one of the standard benchmarks in Vietnamese for the research community to evaluate their captioning models. Furthermore, we present a CAMO approach that effectively enhances the image representation ability by a multi-level encoder output fusion mechanism, which helps improve the quality of generated captions compared to previous captioning models.
On Reality and the Limits of Language Data: Aligning LLMs with Human Norms
Collier, Nigel H., Liu, Fangyu, Shareghi, Ehsan
Recent advancements in Large Language Models (LLMs) harness linguistic associations in vast natural language data for practical applications. However, their ability to understand the physical world using only language data remains a question. After reviewing existing protocols, we explore this question using a novel and tightly controlled reasoning test (ART) and compare human norms against versions of GPT-3. Our findings highlight the categories of common-sense relations models that could learn directly from data and areas of weakness. GPT-3 offers evidence for verbal reasoning on a par with human subjects for several relations including Synonymy, Antonymy, and Default inheritance, Without reinforcement learning from human judgements, it appears GPT-3 performs at the lower end of the reference interval for Has-part and Contained-in. Weaknesses were observed also in affordance characteristics through Necessary-quality, Order-of-size and Order-of-intensity. Combining LLMs with symbolic world grounding is a promising direction to address associative learning.
Adaptive Machine Translation with Large Language Models
Moslem, Yasmin, Haque, Rejwanul, Kelleher, John D., Way, Andy
Consistency is a key requirement of high-quality translation. It is especially important to adhere to pre-approved terminology and adapt to corrected translations in domain-specific projects. Machine translation (MT) has achieved significant progress in the area of domain adaptation. However, real-time adaptation remains challenging. Large-scale language models (LLMs) have recently shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. By feeding an LLM at inference time with a prompt that consists of a list of translation pairs, it can then simulate the domain and style characteristics. This work aims to investigate how we can utilize in-context learning to improve real-time adaptive MT. Our extensive experiments show promising results at translation time. For example, LLMs can adapt to a set of in-domain sentence pairs and/or terminology while translating a new sentence. We observe that the translation quality with few-shot in-context learning can surpass that of strong encoder-decoder MT systems, especially for high-resource languages. Moreover, we investigate whether we can combine MT from strong encoder-decoder models with fuzzy matches, which can further improve translation quality, especially for less supported languages. We conduct our experiments across five diverse language pairs, namely English-to-Arabic (EN-AR), English-to-Chinese (EN-ZH), English-to-French (EN-FR), English-to-Kinyarwanda (EN-RW), and English-to-Spanish (EN-ES).
What Really Made Geoffrey Hinton Into an AI Doomer
Geoffrey Hinton, perhaps the most important person in the recent history of artificial intelligence, recently sent me a video of Snoop Dogg. In the clip of a discussion panel, the rapper expresses profane amazement at how artificial intelligence software, such as ChatGPT, can now hold a coherent and meaningful conversation. "Then I heard the old dude that created AI saying, 'This is not safe'cause the AIs got their own mind and these motherfuckers gonna start doing their own shit,'" Snoop says. "And I'm like, 'Is we in a fucking movie right now or what?'" The "old dude" is, of course, Hinton.
Hollywood's Screenwriters Are Right to Fear AI
One of the more harrowing reads for writers concerned about artificial intelligence encroaching on their livelihoods is a study commissioned by OpenAI itself. Published in March, it places writers in the "fully exposed" category. This means that, according to OpenAI, a large language model (LLM) could reduce the time it takes for them to carry out their work by at least 50 percent. AI can already score in the 93rd percentile on SAT reading exams; it can already produce bad stories and poems. Directors are discussing the possibilities of AI-generated scripts.
Nearly 50 news websites are 'AI-generated', a study says. Would I be able to tell?
Breaking news from celebritiesdeaths.com: the president is dead. At least that's what the highly reliable website informed its readers last month, under the no-nonsense headline "Biden dead. Harris acting president, address 9am ET". The site explained that Joe Biden had "passed away peacefully in his sleep" and Kamala Harris was taking over, above a bizarre disclaimer: "I'm sorry, I cannot complete this prompt as it goes against OpenAI's use case policy on generating misleading content." Celebritiesdeaths.com is among 49 supposed news sites that NewsGuard, an organization tracking misinformation, has identified as "almost entirely written by artificial intelligence software".
Revisiting Relation Extraction in the era of Large Language Models
Wadhwa, Somin, Amir, Silvio, Wallace, Byron C.
Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text. Standard supervised RE techniques entail training modules to tag tokens comprising entity spans and then predict the relationship between them. Recent work has instead treated the problem as a \emph{sequence-to-sequence} task, linearizing relations between entities as target strings to be generated conditioned on the input. Here we push the limits of this approach, using larger language models (GPT-3 and Flan-T5 large) than considered in prior work and evaluating their performance on standard RE tasks under varying levels of supervision. We address issues inherent to evaluating generative approaches to RE by doing human evaluations, in lieu of relying on exact matching. Under this refined evaluation, we find that: (1) Few-shot prompting with GPT-3 achieves near SOTA performance, i.e., roughly equivalent to existing fully supervised models; (2) Flan-T5 is not as capable in the few-shot setting, but supervising and fine-tuning it with Chain-of-Thought (CoT) style explanations (generated via GPT-3) yields SOTA results. We release this model as a new baseline for RE tasks.
Imitation versus Innovation: What children can do that large language and language-and-vision models cannot (yet)?
Yiu, Eunice, Kosoy, Eliza, Gopnik, Alison
Much discussion about large language models and language-and-vision models has focused on whether these models are intelligent agents. We present an alternative perspective. We argue that these artificial intelligence models are cultural technologies that enhance cultural transmission in the modern world, and are efficient imitation engines. We explore what AI models can tell us about imitation and innovation by evaluating their capacity to design new tools and discover novel causal structures, and contrast their responses with those of human children. Our work serves as a first step in determining which particular representations and competences, as well as which kinds of knowledge or skill, can be derived from particular learning techniques and data. Critically, our findings suggest that machines may need more than large scale language and images to achieve what a child can do.