Cross-Genre Authorship Attribution via LLM-Based Retrieve-and-Rerank
Agarwal, Shantanu, Barry, Joel, Fincke, Steven, Miller, Scott
–arXiv.org Artificial Intelligence
Authorship attribution (AA) is the task of identifying the most likely author of a query document from a predefined set of candidate authors. We introduce a two-stage retrieve-and-rerank framework that finetunes LLMs for cross-genre AA. Unlike the field of information retrieval (IR), where retrieve-and-rerank is a de facto strategy, cross-genre AA systems must avoid relying on topical cues and instead learn to identify author-specific linguistic patterns that are independent of the text's subject matter (genre/domain/topic). Consequently, for the reranker, we demonstrate that training strategies commonly used in IR are fundamentally misaligned with cross-genre AA, leading to suboptimal behavior. To address this, we introduce a targeted data curation strategy that enables the reranker to effectively learn author-discriminative signals. Using our LLM-based retrieve-and-rerank pipeline, we achieve substantial gains of 22.3 and 34.4 absolute Success@8 points over the previous state-of-the-art on HIATUS's challenging HRS1 and HRS2 cross-genre AA benchmarks.
arXiv.org Artificial Intelligence
Oct-21-2025
- Country:
- Asia
- Middle East
- Republic of Türkiye (0.04)
- Syria (0.04)
- Russia (0.14)
- Middle East
- Europe
- North America
- Dominican Republic (0.04)
- United States > California (0.14)
- Asia
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Government > Regional Government
- Media > News (0.46)
- Technology: