funniness
Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models
Dogra, Atharvan, Ghosal, Soumya Suvra, Deshpande, Ameet, Kalyan, Ashwin, Manocha, Dinesh
Large language models are increasingly used for creative writing and engagement content, raising safety concerns about the outputs. Therefore, casting humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. This is further supplemented by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that harmful outputs receive higher humor scores which further increase under role-based prompting, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show harmful cues widen predictive uncertainty and surprisingly, can even make harmful punchlines more expected for some models, suggesting structural embedding in learned humor distributions. External validation on an additional satire-generation task with human perceived funniness judgments shows that LLM satire increases stereotypicality and typically toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain $10-21\%$ in mean humor score, stereotypical jokes appear $11\%$ to $28\%$ more often among the jokes marked funny by LLM-based metric and up to $10\%$ more often in generations perceived as funny by humans.
Automatically Detecting Amusing Games in Wordle
Luo, Ronaldo, Liang, Gary, Liu, Cindy, Kabbara, Adam, Bakhtawar, Minahil, Kim, Kina, Guerzhoy, Michael
We explore automatically predicting which Wordle games Reddit users find amusing. We scrape approximately 80k reactions by Reddit users to Wordle games from Reddit, classify the reactions as expressing amusement or not using OpenAI's GPT-3.5 using few-shot prompting, and verify that GPT-3.5's labels roughly correspond to human labels. We then extract features from Wordle games that can predict user amusement. We demonstrate that the features indeed provide a (weak) signal that predicts user amusement as predicted by GPT-3.5. Our results indicate that user amusement at Wordle games can be predicted computationally to some extent. We explore which features of the game contribute to user amusement. We find that user amusement is predictable, indicating a measurable aspect of creativity infused into Wordle games through humor.
Humor Mechanics: Advancing Humor Generation with Multistep Reasoning
Tikhonov, Alexey, Shtykovskiy, Pavel
In this paper, we explore the generation of one-liner jokes through multi-step reasoning. Our work involved reconstructing the process behind creating humorous one-liners and developing a working prototype for humor generation. We conducted comprehensive experiments with human participants to evaluate our approach, comparing it with human-created jokes, zero-shot GPT-4 generated humor, and other baselines. The evaluation focused on the quality of humor produced, using human labeling as a benchmark. Our findings demonstrate that the multi-step reasoning approach consistently improves the quality of generated humor. We present the results and share the datasets used in our experiments, offering insights into enhancing humor generation with artificial intelligence.
Crowd Score: A Method for the Evaluation of Jokes using Large Language Model AI Voters as Judges
Goes, Fabricio, Zhou, Zisen, Sawicki, Piotr, Grzes, Marek, Brown, Daniel G.
This paper presents the Crowd Score, a novel method to assess the funniness of jokes using large language models (LLMs) as AI judges. Our method relies on inducing different personalities into the LLM and aggregating the votes of the AI judges into a single score to rate jokes. We validate the votes using an auditing technique that checks if the explanation for a particular vote is reasonable using the LLM. We tested our methodology on 52 jokes in a crowd of four AI voters with different humour types: affiliative, self-enhancing, aggressive and self-defeating. Our results show that few-shot prompting leads to better results than zero-shot for the voting question. Personality induction showed that aggressive and self-defeating voters are significantly more inclined to find more jokes funny of a set of aggressive/self-defeating jokes than the affiliative and self-enhancing voters. The Crowd Score follows the same trend as human judges by assigning higher scores to jokes that are also considered funnier by human judges. We believe that our methodology could be applied to other creative domains such as story, poetry, slogans, etc. It could both help the adoption of a flexible and accurate standard approach to compare different work in the CC community under a common metric and by minimizing human participation in assessing creative artefacts, it could accelerate the prototyping of creative artefacts and reduce the cost of hiring human participants to rate creative artefacts.
OFAI 2022 Lecture Series - OFAI
According to the perceptual symbol hypothesis (Barsalou, 1999), word concepts trigger mental re-enactments of perceptual states and actions. While many studies have shown how word concepts modulate sensori-motor responses, it is less well known how sensori-motor actions influence access to word concepts in memory. Here, we investigated how well English words with strong horizontal or vertical associations are retrieved from memory dependent on how they are presented during encoding (i.e., horizontally or vertically printed). Initial pre-testing of 129 candidate words yielded 43 words with a strong horizontal association (e.g., floor, beach, border, etc.) and 51 words with a strong vertical association (e.g., tree, crane, bottle, etc.). These were quasi-randomly compiled into 160 'crossword arrays', each containing 5 horizontally and 5 vertically printed items drawn from the horizontal association word set, as well as 5 horizontally and 5 vertically printed items drawn from the vertical association word set.
"So You Think You're Funny?": Rating the Humour Quotient in Standup Comedy
Mittal, Anirudh, Jeevan, Pranav, Gandhi, Prerak, Kanojia, Diptesh, Bhattacharyya, Pushpak
Computational Humour (CH) has attracted the interest of Natural Language Processing and Computational Linguistics communities. Creating datasets for automatic measurement of humour quotient is difficult due to multiple possible interpretations of the content. In this work, we create a multi-modal humour-annotated dataset ($\sim$40 hours) using stand-up comedy clips. We devise a novel scoring mechanism to annotate the training data with a humour quotient score using the audience's laughter. The normalized duration (laughter duration divided by the clip duration) of laughter in each clip is used to compute this humour coefficient score on a five-point scale (0-4). This method of scoring is validated by comparing with manually annotated scores, wherein a quadratic weighted kappa of 0.6 is obtained. We use this dataset to train a model that provides a "funniness" score, on a five-point scale, given the audio and its corresponding text. We compare various neural language models for the task of humour-rating and achieve an accuracy of $0.813$ in terms of Quadratic Weighted Kappa (QWK). Our "Open Mic" dataset is released for further research along with the code.
How Did This Get Funded?! Automatically Identifying Quirky Scientific Achievements
Shani, Chen, Borenstein, Nadav, Shahaf, Dafna
Humor is an important social phenomenon, serving complex social and psychological functions. However, despite being studied for millennia humor is computationally not well understood, often considered an AI-complete problem. In this work, we introduce a novel setting in humor mining: automatically detecting funny and unusual scientific papers. We are inspired by the Ig Nobel prize, a satirical prize awarded annually to celebrate funny scientific achievements (example past winner: "Are cows more likely to lie down the longer they stand?"). This challenging task has unique characteristics that make it particularly suitable for automatic learning. We construct a dataset containing thousands of funny papers and use it to learn classifiers, combining findings from psychology and linguistics with recent advances in NLP. We use our models to identify potentially funny papers in a large dataset of over 630,000 articles. The results demonstrate the potential of our methods, and more broadly the utility of integrating state-of-the-art NLP methods with insights from more traditional disciplines.
kdehumor at semeval-2020 task 7: a neural network model for detecting funniness in dataset humicroedit
This paper describes our contribution to SemEval-2020 Task 7: Assessing Humor in Edited News Headlines. Here we present a method based on a deep neural network. In recent years, quite some attention has been devoted to humor production and perception. Our team KdeHumor employs recurrent neural network models including Bi-Directional LSTMs (BiLSTMs). Moreover, we utilize the state-of-the-art pre-trained sentence embedding techniques. We analyze the performance of our method and demonstrate the contribution of each component of our architecture.
Integrating extracted information from bert and multiple embedding methods with the deep neural network for humour detection
Humour detection from sentences has been an interesting and challenging task in the last few years. In attempts to highlight humour detection, most research was conducted using traditional approaches of embedding, e.g., Word2Vec or Glove. Recently BERT sentence embedding has also been used for this task. In this paper, we propose a framework for humour detection in short texts taken from news headlines. Our proposed framework (IBEN) attempts to extract information from written text via the use of different layers of BERT. After several trials, weights were assigned to different layers of the BERT model. The extracted information was then sent to a Bi-GRU neural network as an embedding matrix. We utilized the properties of some external embedding models. A multi-kernel convolution in our neural network was also employed to extract higher-level sentence representations. This framework performed very well on the task of humour detection.
SemEval-2020 Task 7: Assessing Humor in Edited News Headlines
Hossain, Nabil, Krumm, John, Gamon, Michael, Kautz, Henry
This paper describes the SemEval-2020 shared task "Assessing Humor in Edited News Headlines." The task's dataset contains news headlines in which short edits were applied to make them funny, and the funniness of these edited headlines was rated using crowdsourcing. This task includes two subtasks, the first of which is to estimate the funniness of headlines on a humor scale in the interval 0-3. The second subtask is to predict, for a pair of edited versions of the same original headline, which is the funnier version. To date, this task is the most popular shared computational humor task, attracting 48 teams for the first subtask and 31 teams for the second.