"As for why I tell a lot of stories, there's a joke about that. There was once a man who had a computer, and he asked it, 'Do you compute that you will ever be able to think like a human being?' And after assorted grindings and beepings, a slip of paper came out of the computer that said, 'That reminds me of a story . . . "
– from ANGELS FEAR: TOWARDS AN EPISTEMOLOGY OF THE SACRED. Gregory Bateson & Mary Catherine Bateson. (Part III 'Metalogue').
As we rely more on natural language processing to help us navigate our world, it's more important than ever that these artificial intelligence models -- used increasingly in applications such as caption generation for the visually impaired -- remain true to reality. "The issue is that deep learning-based neural language generation models have no guarantees in generating factually correct sentences that are faithful to the input data," said UC Santa Barbara computer scientist William Wang. Over the many iterations it takes for a language generation model to learn how to describe or predict what a scene depicts, elements can creep in, causing phenomena such as errors in data-to-text translations or object hallucinations, in which the caption contains an object or an action that doesn't exist in the image. As a result, unless you have a way of reining in these errors (or you're surrealist painter René Magritte) these mismatches could spell the end of the usefulness of the language generation model being used. "This is a huge problem," said Wang. "Imagine you are using a news summarization system to read earnings reports -- the loss of faithfulness can give you wrong numbers, wrong facts and misinformation. Similarly, if a visually impaired person relies on an image captioning system to see the environment, wrong generation could create serious consequences."
One of the biggest trends in machine learning right now is text generation. AI systems learn by absorbing billions of words scraped from the internet and generate text in response to a variety of prompts. It sounds simple, but these machines can be put to a wide array of tasks -- from creating fiction, to writing bad code, to letting you chat with historical figures. The best-known AI text-generator is OpenAI's GPT-3, which the company recently announced is now being used in more than 300 different apps, by "tens of thousands" of developers, and producing 4.5 billion words per day. This may be an arbitrary milestone for OpenAI to celebrate, but it's also a useful indicator of the growing scale, impact, and commercial potential of AI text generation.
As transparency becomes key for robotics and AI, it will be necessary to evaluate the methods through which transparency is provided, including automatically generated natural language (NL) explanations. Here, we explore parallels between the generation of such explanations and the much-studied field of evaluation of Natural Language Generation (NLG). Specifically, we investigate which of the NLG evaluation measures map well to explanations. We present the ExBAN corpus: a crowd-sourced corpus of NL explanations for Bayesian Networks. We run correlations comparing human subjective ratings with NLG automatic measures. We find that embedding-based automatic NLG evaluation methods, such as BERTScore and BLEURT, have a higher correlation with human ratings, compared to word-overlap metrics, such as BLEU and ROUGE. This work has implications for Explainable AI and transparent robotic and autonomous systems.
Large-scale transformer-based language models (LMs) demonstrate impressive capabilities in open text generation. However, controlling the generated text's properties such as the topic, style, and sentiment is challenging and often requires significant changes to the model architecture or retraining and fine-tuning the model on new supervised data. This paper presents a novel approach for Topical Language Generation (TLG) by combining a pre-trained LM with topic modeling information. We cast the problem using Bayesian probability formulation with topic probabilities as a prior, LM probabilities as the likelihood, and topical language generation probability as the posterior. In learning the model, we derive the topic probability distribution from the user-provided document's natural structure. Furthermore, we extend our model by introducing new parameters and functions to influence the quantity of the topical features presented in the generated text. This feature would allow us to easily control the topical properties of the generated text. Our experimental results demonstrate that our model outperforms the state-of-the-art results on coherency, diversity, and fluency while being faster in decoding.
CTO & MD at AX Semantics, the SaaS-based, Natural Language Generation Platform that creates any content, in any language, at any scale. The pandemic brought on economic, logistical and technological challenges on a massive global scale, leaving businesses scrambling to adapt. Amidst the upheaval, businesses turned to video conferencing platforms like Zoom and Google Meet to stay connected. Technologies like artificial intelligence (AI) and machine learning (ML) helped augment human efforts to take on everything from health to cybersecurity. Equally, businesses looked toward strategic execution and technology to remain agile among industry shifts and provide a greater return on investments.
CTO & MD at AX Semantics, the SaaS-based, Natural Language Generation Platform that creates any content, in any language, at any scale. The pandemic brought on economic, logistical and technological challenges on a massive global scale, leaving businesses scrambling to adapt. Amidst the upheaval, businesses turned to video conferencing platforms like Zoom and Google Meet to stay connected. Technologies like artificial intelligence (AI) and machine learning (ML) helped augment human efforts to take on everything fromhealth tocybersecurity. Equally, businesses looked toward strategic execution and technology to remain agile among industry shifts and provide a greater return on investments.
Natural language models are the building blocks of apps including machine translators, text summarizers, chatbots, and writing assistants. But there's growing evidence showing that these models risk reinforcing undesirable stereotypes, mostly because a portion of the training data is commonly sourced from communities with gender, race, and religious prejudices. For example, OpenAI's GPT-3 places words like "naughty" or "sucked" near female pronouns and "Islam" near words like "terrorism." A new study from researchers affiliated with Amazon and the University of California, Santa Barbara aims to shed light specifically on biases in open-ended English natural language generation. The researchers created what they claim is the largest benchmark dataset of its kind containing 23,679 prompts, 5 domains, and 43 subgroups extracted from Wikipedia articles.
Gehrmann, Sebastian, Adewumi, Tosin, Aggarwal, Karmanya, Ammanamanchi, Pawan Sasanka, Anuoluwapo, Aremu, Bosselut, Antoine, Chandu, Khyathi Raghavi, Clinciu, Miruna, Das, Dipanjan, Dhole, Kaustubh D., Du, Wanyu, Durmus, Esin, Dušek, Ondřej, Emezue, Chris, Gangal, Varun, Garbacea, Cristina, Hashimoto, Tatsunori, Hou, Yufang, Jernite, Yacine, Jhamtani, Harsh, Ji, Yangfeng, Jolly, Shailza, Kumar, Dhruv, Ladhak, Faisal, Madaan, Aman, Maddela, Mounica, Mahajan, Khyati, Mahamood, Saad, Majumder, Bodhisattwa Prasad, Martins, Pedro Henrique, McMillan-Major, Angelina, Mille, Simon, van Miltenburg, Emiel, Nadeem, Moin, Narayan, Shashi, Nikolaev, Vitaly, Niyongabo, Rubungo Andre, Osei, Salomey, Parikh, Ankur, Perez-Beltrachini, Laura, Rao, Niranjan Ramesh, Raunak, Vikas, Rodriguez, Juan Diego, Santhanam, Sashank, Sedoc, João, Sellam, Thibault, Shaikh, Samira, Shimorina, Anastasia, Cabezudo, Marco Antonio Sobrevilla, Strobelt, Hendrik, Subramani, Nishant, Xu, Wei, Yang, Diyi, Yerukola, Akhila, Zhou, Jiawei
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. However, due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of corpora and evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the initial release for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.
There's been an explosion in recent years of natural language processing (NLP) datasets aimed at testing various AI capabilities. Many of these datasets have accompanying leaderboards, which provide a means of ranking and comparing models. But the adoption of leaderboards has thus far been limited to setups with automatic evaluation, like classification and knowledge retrieval. Open-ended tasks requiring natural language generation such as language translation, where there are often many correct solutions, lack techniques that can reliably automatically evaluate a model's quality. To remedy this, researchers at the Allen Institute for Artificial Intelligence, the Hebrew University of Jerusalem, and the University of Washington created GENIE, a leaderboard for human-in-the-loop evaluation of text generation.
Recent advances in deep learning techniques have enabled machines to generate cohesive open-ended text when prompted with a sequence of words as context. While these models now empower many downstream applications from conversation bots to automatic storytelling, they have been shown to generate texts that exhibit social biases. To systematically study and benchmark social biases in open-ended language generation, we introduce the Bias in Open-Ended Language Generation Dataset (BOLD), a large-scale dataset that consists of 23,679 English text generation prompts for bias benchmarking across five domains: profession, gender, race, religion, and political ideology. We also propose new automated metrics for toxicity, psycholinguistic norms, and text gender polarity to measure social biases in open-ended text generation from multiple angles. An examination of text generated from three popular language models reveals that the majority of these models exhibit a larger social bias than human-written Wikipedia text across all domains. With these results we highlight the need to benchmark biases in open-ended language generation and caution users of language generation models on downstream tasks to be cognizant of these embedded prejudices.