Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification

Zhang, Xiangxie, Beinke, Ben, Kindhi, Berlian Al, Wiering, Marco

Nov-1-2020–arXiv.org Artificial Intelligence

The classification of DNA sequences is a key research area in bioinformatics as it enables researchers to conduct genomic analysis and detect possible diseases. In this paper, three state-of-the-art algorithms, namely Convolutional Neural Networks, Deep Neural Networks, and N-gram Probabilistic Models, are used for the task of DNA classification. Furthermore, we introduce a novel feature extraction method based on the Levenshtein distance and randomly generated DNA sub-sequences to compute information-rich features from the DNA sequences. We also use an existing feature extraction method based on 3-grams to represent amino acids and combine both feature extraction methods with a multitude of machine learning algorithms. Four different data sets, each concerning viral diseases such as Covid-19, AIDS, Influenza, and Hepatitis C, are used for evaluating the different approaches. The results of the experiments show that all methods obtain high accuracies on the different DNA datasets. Furthermore, the domain-specific 3-gram feature extraction method leads in general to the best results in the experiments, while the newly proposed technique outperforms all other methods on the smallest Covid-19 dataset

algorithm, dna sequence, sequence, (13 more...)

arXiv.org Artificial Intelligence

Nov-1-2020

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.04)
- Africa > South Africa (0.04)
- South America
  - Peru (0.04)
  - Colombia (0.04)
  - Brazil (0.04)
- North America > United States
  - New York (0.04)
- Europe
  - Sweden (0.04)
  - Spain (0.04)
  - Italy (0.04)
  - Greece (0.04)
  - France (0.04)
  - Netherlands > North Holland
    - Amsterdam (0.04)
- Asia
  - India (0.04)
  - Sri Lanka (0.04)
  - Taiwan (0.04)
  - Indonesia (0.04)
  - Japan (0.04)
  - Vietnam (0.04)
  - Nepal (0.04)
  - Pakistan (0.04)
  - China
    - Hubei Province > Wuhan (0.04)
    - Hong Kong (0.04)
  - Middle East
    - Iran (0.04)
    - Republic of Türkiye (0.04)
    - Israel (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Health & Medicine > Therapeutic Area
  - Infections and Infectious Diseases (1.00)
  - Immunology (1.00)

Technology:
- Information Technology
  - Data Science > Data Mining
    - Feature Extraction (1.00)
  - Artificial Intelligence > Machine Learning
    - Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found