Automatic Authorship Attribution of Noisy Documents
Sayoud, Halim (University of Sciences and Technology Houari Boumediene (USTHB)) | Khennouf, Salah (University of Sciences and Technology Houari Boumediene (USTHB)) | Benzerroug, Hocine ( Independent Researcher ) | Hamadache, Zohra (University of Sciences and Technology Houari Boumediene (USTHB)) | Hadjadj, Hassina (University of Sciences and Technology Houari Boumediene (USTHB)) | Ouamour, Siham (University of Sciences and Technology Houari Boumediene (USTHB))
In this survey, we conduct an investigation on the robustness of several features and classifiers in automatic authorship attribution. Our corpus consists in 25 different documents written by 5 different American philosophers in English. The different documents pass throw a digital conversion into grey-scaled images and several levels of noise are added to corrupt those image documents. The noise consists in a “Salt & Pepper” type, which is randomly added on the surface of the images with the following noise levels: 0%, 1%, 2%, 3%, 4%, 5%, 6% and 7%. Thus, each image goes throw an OCR program (Optical Character Recognition) to extract the text from the image. Then, the obtained text document is kept to be used during the experiments of authorship attribution. Several features and classifiers are employed and evaluated with regards to the classification performances. Results are quite interesting and show that the most robust feature in au-thorship attribution is the character-tetragram, which provides a score of 100% even at a noise level of 7%.
May-16-2017