Learning Personalized Story Evaluation

Wang, Danqing, Yang, Kevin, Zhu, Hanlin, Yang, Xiaomeng, Cohen, Andrew, Li, Lei, Tian, Yuandong

arXiv.org Artificial Intelligence 

While large language models (LLMs) have shown impressive results for more objective tasks such as QA and retrieval, it remains nontrivial to evaluate their performance on open-ended text generation for reasons including (1) data contamination; (2) multi-dimensional evaluation criteria; and (3) subjectiveness stemming from reviewers' personal preferences. To address such issues, we propose to model personalization in an uncontaminated open-ended generation assessment. We create two new datasets Per-MPST and Per-DOC for personalized story evaluation, by re-purposing existing datasets with proper anonymization and new personalized labels. SE to infer reviewer preferences and provide a personalized evaluation. SE predicts either a detailed review or fine-grained comparison in several aspects (such as interestingness and surprise) for that reviewer on a new text input. SE outperforms GPT-4 by 15.8% on Kendall correlation of story ratings, and by 13.7% on pairwise preference prediction accuracy. Both datasets and code will be released. LLMs' abilities in open-ended text generation are still insufficiently Meanwhile, some recent metrics propose to directly use strong LLMs as evaluators (Fu et al., 2023; Liu et al., Besides, the contamination problem may affect the evaluation performance, similar to other tasks (Chang et al., 2023). Human evaluation is also widely used in open-ended text generation. However, it may be timeconsuming and expensive, especially for larger-scale evaluation. This personalization issue in text generation has recently attracted increasing attention (Flek, 2020; Dudy et al., 2021), but personalization in evaluation is still under-explored. In this paper, we explore personalized evaluation for long-form story generation, where the assessment is heavily influenced by reviewers' personal preferences. For example, Figure 1 illustrates two reviewers' opinions when comparing two plots derived from the same premise. Reviewer 1 prefers Plot A for its uplifting ending while Reviewer 2 favors Plot B because of the plot complexity and empathetic ending. To model such diverse preferences in story evaluation, the major difficulty lies in the following aspects: personalization story evaluation dataset modeling, i.e., uncontaminated story datasets with personal information, and reviewer preference modeling, i.e., effective methods to capture reviewer preferences and evaluate stories from a particular individual's perspective. Few story evaluation datasets have personal labels due to the difficulty of collecting personal information. Besides, most existing story datasets have been exposed to LLMs.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found