BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution

Moosa, Abdullah Muhammad, Sultana, Nusrat, Moosa, Mahdi Muhammad, Hossain, Md. Miraiz

Nov-12-2025–arXiv.org Artificial Intelligence

This research presents a comprehensive investigation into Bangla authorship attribution, introducing a new balanced benchmark corpus BARD10 (Bangla Authorship Recognition Dataset of 10 authors) and systematically analyzing the impact of stop - word removal across classical and deep learning models to uncover the stylistic significance of Bangla stop - words. BARD10 is a curated corpus of Bangla blog and opinion prose from ten contemporary authors, alongside the methodical assessment of four representative class ifiers: SVM (Support V ector Machine), Bangla BERT (Bidirectional Encoder Representations from Transformers), XGBoost, and a MLP (Multilayer Perce p tion), utilizing uniform preprocessing on both BARD10 and the benchmark corpora BAAD16 (Bangla Authorship Attribution Dataset of 16 authors). In all datasets, the classical TF - IDF + SVM baseline outperformed, attaining a macro - F1 score of 0.997 on BAAD16 a nd 0.921 on BARD10, while Bangla BERT lagged by as much as five points. This study reveals that BARD10 authors are highly sensitive to sto p - word pruning, while BAAD16 authors remain comparatively robust highlighting genre - dependent reliance on stop - word signatures. Error analysis revealed that high frequency components transmit authorial signatures that are diminished or reduced by transformer models. Three insights are identified: Bangla stop - words serve as essential stylistic indicators; finely calibrated ML models prove effective within short - text limitations; and BARD10 connects formal literature with contemporary web dialogue, offering a reproducible benchmark for future long - context or domain - adapted transformers.

bard10, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Nov-12-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Bangladesh
  - Dhaka Division > Dhaka District > Dhaka (0.04)
- Europe > Switzerland (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found