BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution
Moosa, Abdullah Muhammad, Sultana, Nusrat, Moosa, Mahdi Muhammad, Hossain, Md. Miraiz
–arXiv.org Artificial Intelligence
This research presents a comprehensive investigation into Bangla authorship attribution, introducing a new balanced benchmark corpus BARD10 (Bangla Authorship Recognition Dataset of 10 authors) and systematically analyzing the impact of stop - word removal across classical and deep learning models to uncover the stylistic significance of Bangla stop - words. BARD10 is a curated corpus of Bangla blog and opinion prose from ten contemporary authors, alongside the methodical assessment of four representative class ifiers: SVM (Support V ector Machine), Bangla BERT (Bidirectional Encoder Representations from Transformers), XGBoost, and a MLP (Multilayer Perce p tion), utilizing uniform preprocessing on both BARD10 and the benchmark corpora BAAD16 (Bangla Authorship Attribution Dataset of 16 authors). In all datasets, the classical TF - IDF + SVM baseline outperformed, attaining a macro - F1 score of 0.997 on BAAD16 a nd 0.921 on BARD10, while Bangla BERT lagged by as much as five points. This study reveals that BARD10 authors are highly sensitive to sto p - word pruning, while BAAD16 authors remain comparatively robust highlighting genre - dependent reliance on stop - word signatures. Error analysis revealed that high frequency components transmit authorial signatures that are diminished or reduced by transformer models. Three insights are identified: Bangla stop - words serve as essential stylistic indicators; finely calibrated ML models prove effective within short - text limitations; and BARD10 connects formal literature with contemporary web dialogue, offering a reproducible benchmark for future long - context or domain - adapted transformers.
arXiv.org Artificial Intelligence
Nov-12-2025
- Country:
- Asia > Bangladesh
- Dhaka Division > Dhaka District > Dhaka (0.04)
- Europe > Switzerland (0.04)
- Asia > Bangladesh
- Genre:
- Research Report (1.00)
- Technology: