Authorship Verification based on the Likelihood Ratio of Grammar Models
Nini, Andrea, Halvani, Oren, Graner, Lukas, Gherardi, Valerio, Ishihara, Shunichi
–arXiv.org Artificial Intelligence
Authorship Verification (AV) is the process of analyzing a set of documents to determine whether they were written by a specific author. This problem often arises in forensic scenarios, e.g., in cases where the documents in question constitute evidence for a crime. Existing state-of-the-art AV methods use computational solutions that are not supported by a plausible scientific explanation for their functioning and that are often difficult for analysts to interpret. To address this, we propose a method relying on calculating a quantity we call $\lambda_G$ (LambdaG): the ratio between the likelihood of a document given a model of the Grammar for the candidate author and the likelihood of the same document given a model of the Grammar for a reference population. These Grammar Models are estimated using $n$-gram language models that are trained solely on grammatical features. Despite not needing large amounts of data for training, LambdaG still outperforms other established AV methods with higher computational complexity, including a fine-tuned Siamese Transformer network. Our empirical evaluation based on four baseline methods applied to twelve datasets shows that LambdaG leads to better results in terms of both accuracy and AUC in eleven cases and in all twelve cases if considering only topic-agnostic methods. The algorithm is also highly robust to important variations in the genre of the reference population in many cross-genre comparisons. In addition to these properties, we demonstrate how LambdaG is easier to interpret than the current state-of-the-art. We argue that the advantage of LambdaG over other methods is due to fact that it is compatible with Cognitive Linguistic theories of language processing.
arXiv.org Artificial Intelligence
Mar-13-2024
- Country:
- Oceania > Australia
- South Australia > Adelaide (0.04)
- Australian Capital Territory > Canberra (0.04)
- North America
- United States
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- New York > New York County
- New York City (0.04)
- California
- Santa Clara County > Stanford (0.04)
- San Diego County > San Diego (0.04)
- Pennsylvania > Allegheny County
- Canada > Alberta
- United States
- Europe
- Bulgaria (0.04)
- Greece > Central Macedonia
- Thessaloniki (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Spain > Valencian Community
- Valencia Province > Valencia (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.14)
- South Yorkshire > Sheffield (0.04)
- Oxfordshire > Oxford (0.04)
- Kent > Canterbury (0.04)
- Italy
- Tuscany > Florence (0.04)
- Piedmont > Turin Province
- Turin (0.04)
- Emilia-Romagna > Metropolitan City of Bologna
- Bologna (0.04)
- Slovenia > Drava
- Municipality of Benedikt > Benedikt (0.04)
- Germany > Hesse
- Darmstadt Region > Darmstadt (0.04)
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- France > Occitanie
- Haute-Garonne > Toulouse (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Asia > China
- Hong Kong (0.04)
- Oceania > Australia
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Law (1.00)
- Information Technology > Security & Privacy (0.92)
- Law Enforcement & Public Safety (0.67)
- Media (0.67)
- Technology:
- Information Technology
- Communications > Social Media (0.93)
- Artificial Intelligence
- Cognitive Science (0.93)
- Representation & Reasoning > Uncertainty (0.67)
- Natural Language
- Text Processing (1.00)
- Grammars & Parsing (0.67)
- Machine Learning
- Statistical Learning (1.00)
- Neural Networks > Deep Learning (1.00)
- Learning Graphical Models > Directed Networks
- Bayesian Learning (0.67)
- Information Technology