Betke, Margrit
DebiasPI: Inference-time Debiasing by Prompt Iteration of a Text-to-Image Generative Model
Bonna, Sarah, Huang, Yu-Cheng, Novozhilova, Ekaterina, Paik, Sejin, Shan, Zhengyang, Feng, Michelle Yilin, Gao, Ge, Tayal, Yonish, Kulkarni, Rushil, Yu, Jialin, Divekar, Nupur, Ghadiyaram, Deepti, Wijaya, Derry, Betke, Margrit
Ethical intervention prompting has emerged as a tool to counter demographic biases of text-to-image generative AI models. Existing solutions either require to retrain the model or struggle to generate images that reflect desired distributions on gender and race. We propose an inference-time process called DebiasPI for Debiasing-by-Prompt-Iteration that provides prompt intervention by enabling the user to control the distributions of individuals' demographic attributes in image generation. DebiasPI keeps track of which attributes have been generated either by probing the internal state of the model or by using external attribute classifiers. Its control loop guides the text-to-image model to select not yet sufficiently represented attributes, With DebiasPI, we were able to create images with equal representations of race and gender that visualize challenging concepts of news headlines. We also experimented with the attributes age, body type, profession, and skin tone, and measured how attributes change when our intervention prompt targets the distribution of an unrelated attribute type. We found, for example, if the text-to-image model is asked to balance racial representation, gender representation improves but the skin tone becomes less diverse. Attempts to cover a wide range of skin colors with various intervention prompts showed that the model struggles to generate the palest skin tones. We conducted various ablation studies, in which we removed DebiasPI's attribute control, that reveal the model's propensity to generate young, male characters.
ExeChecker: Where Did I Go Wrong?
Gu, Yiwen, Patel, Mahir, Betke, Margrit
In this paper, we present a contrastive learning based framework, ExeChecker, for the interpretation of rehabilitation exercises. Our work builds upon state-of-the-art advances in the area of human pose estimation, graph-attention neural networks, and transformer interpretablity. The downstream task is to assist rehabilitation by providing informative feedback to users while they are performing prescribed exercises. We utilize a contrastive learning strategy during training. Given a tuple of correctly and incorrectly executed exercises, our model is able to identify and highlight those joints that are involved in an incorrect movement and thus require the user's attention. We collected an in-house dataset, ExeCheck, with paired recordings of both correct and incorrect execution of exercises. In our experiments, we tested our method on this dataset as well as the UI-PRMD dataset and found ExeCheck outperformed the baseline method using pairwise sequence alignment in identifying joints of physical relevance in rehabilitation exercises.
Enhancing Emotion Prediction in News Headlines: Insights from ChatGPT and Seq2Seq Models for Free-Text Generation
Gao, Ge, Kim, Jongin, Paik, Sejin, Novozhilova, Ekaterina, Liu, Yi, Bonna, Sarah T., Betke, Margrit, Wijaya, Derry Tanti
Predicting emotions elicited by news headlines can be challenging as the task is largely influenced by the varying nature of people's interpretations and backgrounds. Previous works have explored classifying discrete emotions directly from news headlines. We provide a different approach to tackling this problem by utilizing people's explanations of their emotion, written in free-text, on how they feel after reading a news headline. Using the dataset BU-NEmo+ (Gao et al., 2022), we found that for emotion classification, the free-text explanations have a strong correlation with the dominant emotion elicited by the headlines. The free-text explanations also contain more sentimental context than the news headlines alone and can serve as a better input to emotion classification models. Therefore, in this work we explored generating emotion explanations from headlines by training a sequence-to-sequence transformer model and by using pretrained large language model, ChatGPT (GPT-4). We then used the generated emotion explanations for emotion classification. In addition, we also experimented with training the pretrained T5 model for the intermediate task of explanation generation before fine-tuning it for emotion classification. Using McNemar's significance test, methods that incorporate GPT-generated free-text emotion explanations demonstrated significant improvement (P-value < 0.05) in emotion classification from headlines, compared to methods that only use headlines. This underscores the value of using intermediate free-text explanations for emotion prediction tasks with headlines.
BUOCA: Budget-Optimized Crowd Worker Allocation
Sameki, Mehrnoosh, Lai, Sha, Mays, Kate K., Guo, Lei, Ishwar, Prakash, Betke, Margrit
Due to concerns about human error in crowdsourcing, it is standard practice to collect labels for the same data point from multiple internet workers. We here show that the resulting budget can be used more effectively with a flexible worker assignment strategy that asks fewer workers to analyze easy-to-label data and more workers to analyze data that requires extra scrutiny. Our main contribution is to show how the allocations of the number of workers to a task can be computed optimally based on task features alone, without using worker profiles. Our target tasks are delineating cells in microscopy images and analyzing the sentiment toward the 2016 U.S. presidential candidates in tweets. We first propose an algorithm that computes budget-optimized crowd worker allocation (BUOCA). We next train a machine learning system (BUOCA-ML) that predicts an optimal number of crowd workers needed to maximize the accuracy of the labeling. We show that the computed allocation can yield large savings in the crowdsourcing budget (up to 49 percent points) while maintaining labeling accuracy. Finally, we envisage a human-machine system for performing budget-optimized data analysis at a scale beyond the feasibility of crowdsourcing.
Predicting Quality of Crowdsourced Image Segmentations from Crowd Behavior
Sameki, Mehrnoosh (Boston University) | Gurari, Danna (Boston University) | Betke, Margrit (Boston University)
Quality control (QC) is an integral part of many crowd- sourcing systems. However, popular QC methods, such as aggregating multiple annotations, filtering workers, or verifying the quality of crowd work, introduce additional costs and delays. We propose a complementary paradigm to these QC methods based on predicting the quality of submitted crowd work. In particular, we pro- pose to predict the quality of a given crowd drawing directly from a crowd worker’s drawing time, number of user clicks, and average time per user click. We focus on the task of drawing the boundary of a single object in an image. To train and test our prediction models, we collected a total of 2,025 crowd-drawn segmentations for 405 familiar everyday images and unfamiliar biomedical images from 90 unique crowd workers. We first evaluated five prediction models learned using different combinations of the three worker behavior cues for all images. Experiments revealed that time per number of user clicks was the most effective cue for predicting segmentation quality. We next inspected the predictive power of models learned using crowd annotations collected for familiar and unfamiliar data independently. Prediction models were significantly more effective for estimating the segmentation quality from crowd worker behavior for familiar image content than unfamiliar image content.