Long-form evaluation of model editing
Rosati, Domenic, Gonzales, Robie, Chen, Jinkun, Yu, Xuemin, Erkan, Melis, Kayani, Yahya, Chavatapalli, Satya Deepika, Rudzicz, Frank, Sajjad, Hassan
–arXiv.org Artificial Intelligence
Evaluations of model editing currently only use the `next few token' completions after a prompt. As a result, the impact of these methods on longer natural language generation is largely unknown. We introduce long-form evaluation of model editing (\textbf{\textit{LEME}}) a novel evaluation protocol that measures the efficacy and impact of model editing in long-form generative settings. Our protocol consists of a machine-rated survey and a classifier which correlates well with human ratings. Importantly, we find that our protocol has very little relationship with previous short-form metrics (despite being designed to extend efficacy, generalization, locality, and portability into a long-form setting), indicating that our method introduces a novel set of dimensions for understanding model editing methods. Using this protocol, we benchmark a number of model editing techniques and present several findings including that, while some methods (ROME and MEMIT) perform well in making consistent edits within a limited scope, they suffer much more from factual drift than other methods. Finally, we present a qualitative analysis that illustrates common failure modes in long-form generative settings including internal consistency, lexical cohesion, and locality issues.
arXiv.org Artificial Intelligence
Feb-14-2024
- Country:
- Africa > Middle East
- Somalia (0.04)
- Asia
- Bangladesh > Dhaka Division
- Dhaka District > Dhaka (0.04)
- East Asia (0.04)
- India > Karnataka
- Bengaluru (0.04)
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Middle East
- Israel > Tel Aviv District
- Tel Aviv (0.04)
- Lebanon > Beirut Governorate
- Beirut (0.04)
- Israel > Tel Aviv District
- Philippines (0.04)
- Russia (0.04)
- Bangladesh > Dhaka Division
- Atlantic Ocean > Mediterranean Sea (0.04)
- Europe
- Finland (0.04)
- France > Île-de-France
- Germany (0.04)
- Italy
- Middle East > Malta
- Port Region > Southern Harbour District > Valletta (0.04)
- Poland
- Lesser Poland Province > Kraków (0.04)
- Masovia Province > Warsaw (0.04)
- Russia (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- North America
- Canada
- Nova Scotia > Halifax Regional Municipality
- Halifax (0.04)
- Ontario > Toronto (0.04)
- Nova Scotia > Halifax Regional Municipality
- United States
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Mississippi > Adams County
- Natchez (0.04)
- New York (0.04)
- Louisiana > Orleans Parish
- Canada
- Oceania > Australia
- New South Wales > Sydney (0.04)
- Africa > Middle East
- Genre:
- Personal (0.93)
- Questionnaire & Opinion Survey (1.00)
- Research Report
- Experimental Study (1.00)
- New Finding (0.68)
- Industry:
- Education (0.92)
- Leisure & Entertainment > Games
- Computer Games (0.67)
- Technology: