CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization
Joshi, Brihi, Venkatapathy, Sriram, Bansal, Mohit, Peng, Nanyun, Chang, Haw-Shiuan
–arXiv.org Artificial Intelligence
Evaluating creative text such as human-written stories using language models has always been a challenging task -- owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (CoT) generates free-text explanations that help guide a model's predictions and Self-Consistency (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating 'fluent-looking' explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose $\textbf{C}$hain-$\textbf{o}$f-$\textbf{Ke}$ywords (CoKe), that generates a sequence of keywords $\textit{before}$ generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less number of parameters.
arXiv.org Artificial Intelligence
Mar-21-2025
- Country:
- Oceania > Australia
- North America
- United States > California (0.14)
- Dominican Republic (0.04)
- Canada > Ontario
- Toronto (0.04)
- Europe
- United Kingdom (0.04)
- Monaco (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Asia
- Singapore (0.04)
- Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Genre:
- Research Report > New Finding (0.34)
- Technology: