BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution

Moosa, Abdullah Muhammad, Sultana, Nusrat, Moosa, Mahdi Muhammad, Hossain, Md. Miraiz

arXiv.org Artificial Intelligence 

This research presents a comprehensive investigation into Bangla authorship attribution, introducing a new balanced benchmark corpus BARD10 (Bangla Authorship Recognition Dataset of 10 authors) and systematically analyzing the impact of stop - word removal across classical and deep learning models to uncover the stylistic significance of Bangla stop - words. BARD10 is a curated corpus of Bangla blog and opinion prose from ten contemporary authors, alongside the methodical assessment of four representative class ifiers: SVM (Support V ector Machine), Bangla BERT (Bidirectional Encoder Representations from Transformers), XGBoost, and a MLP (Multilayer Perce p tion), utilizing uniform preprocessing on both BARD10 and the benchmark corpora BAAD16 (Bangla Authorship Attribution Dataset of 16 authors). In all datasets, the classical TF - IDF + SVM baseline outperformed, attaining a macro - F1 score of 0.997 on BAAD16 a nd 0.921 on BARD10, while Bangla BERT lagged by as much as five points. This study reveals that BARD10 authors are highly sensitive to sto p - word pruning, while BAAD16 authors remain comparatively robust highlighting genre - dependent reliance on stop - word signatures. Error analysis revealed that high frequency components transmit authorial signatures that are diminished or reduced by transformer models. Three insights are identified: Bangla stop - words serve as essential stylistic indicators; finely calibrated ML models prove effective within short - text limitations; and BARD10 connects formal literature with contemporary web dialogue, offering a reproducible benchmark for future long - context or domain - adapted transformers.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found