On the State of the Art in Authorship Attribution and Authorship Verification
Tyo, Jacob, Dhingra, Bhuwan, Lipton, Zachary C.
–arXiv.org Artificial Intelligence
Despite decades of research on authorship attribution (AA) and authorship verification (AV), inconsistent dataset splits/filtering and mismatched evaluation methods make it difficult to assess the state of the art. In this paper, we present a survey of the fields, resolve points of confusion, introduce Valla that standardizes and benchmarks AA/AV datasets and metrics, provide a large-scale empirical evaluation, and provide apples-to-apples comparisons between existing methods. We evaluate eight promising methods on fifteen datasets (including distribution-shifted challenge sets) and introduce a new large-scale dataset based on texts archived by Project Gutenberg. Surprisingly, we find that a traditional Ngram-based model performs best on 5 (of 7) AA tasks, achieving an average macro-accuracy of $76.50\%$ (compared to $66.71\%$ for a BERT-based model). However, on the two AA datasets with the greatest number of words per author, as well as on the AV datasets, BERT-based models perform best. While AV methods are easily applied to AA, they are seldom included as baselines in AA papers. We show that through the application of hard-negative mining, AV methods are competitive alternatives to AA methods. Valla and all experiment code can be found here: https://github.com/JacobTyo/Valla
arXiv.org Artificial Intelligence
Oct-5-2022
- Country:
- North America
- Dominican Republic (0.04)
- United States
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Pennsylvania > Allegheny County
- Europe
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Slovenia > Drava
- Municipality of Benedikt > Benedikt (0.04)
- Romania > București - Ilfov Development Region
- Municipality of Bucharest > Bucharest (0.05)
- Netherlands > South Holland
- Dordrecht (0.04)
- Greece > Central Macedonia
- Thessaloniki (0.04)
- France > Occitanie
- Haute-Garonne > Toulouse (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Spain > Catalonia
- Asia
- Middle East > Palestine (0.04)
- India > Bihar
- Patna (0.04)
- North America
- Genre:
- Overview (1.00)
- Research Report > New Finding (0.93)
- Industry:
- Information Technology (0.93)
- Technology: