Automatic Authorship Attribution of Noisy Documents

Sayoud, Halim (University of Sciences and Technology Houari Boumediene (USTHB)) | Khennouf, Salah (University of Sciences and Technology Houari Boumediene (USTHB)) | Benzerroug, Hocine ( Independent Researcher ) | Hamadache, Zohra (University of Sciences and Technology Houari Boumediene (USTHB)) | Hadjadj, Hassina (University of Sciences and Technology Houari Boumediene (USTHB)) | Ouamour, Siham (University of Sciences and Technology Houari Boumediene (USTHB))

May-16-2017–AAAI Conferences

In this survey, we conduct an investigation on the robustness of several features and classifiers in automatic authorship attribution. Our corpus consists in 25 different documents written by 5 different American philosophers in English. The different documents pass throw a digital conversion into grey-scaled images and several levels of noise are added to corrupt those image documents. The noise consists in a “Salt & Pepper” type, which is randomly added on the surface of the images with the following noise levels: 0%, 1%, 2%, 3%, 4%, 5%, 6% and 7%. Thus, each image goes throw an OCR program (Optical Character Recognition) to extract the text from the image. Then, the obtained text document is kept to be used during the experiments of authorship attribution. Several features and classifiers are employed and evaluated with regards to the classification performances. Results are quite interesting and show that the most robust feature in au-thorship attribution is the character-tetragram, which provides a score of 100% even at a noise level of 7%.

automatic authorship attribution, machine learning, optical character recognition, (2 more...)

AAAI Conferences

May-16-2017

Conferences PDF

Add feedback

Genre:
- Overview (0.53)

Technology:
- Information Technology > Artificial Intelligence
  - Vision > Optical Character Recognition (0.53)
  - Machine Learning (0.53)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found