Loading...
Please wait, while we are loading the content...
Similar Documents
A Deep Learning CNN Model for Genome Sequence Classification
| Content Provider | Scilit |
|---|---|
| Author | Gunasekaran, Hemalatha Ramalakshmi, K. Ramanathan, Shalini Venkatesan, R. |
| Copyright Year | 2021 |
| Description | Book Name: Intelligent Computing Applications for COVID-19 |
| Abstract | The COVID-19 pandemic declared by the World Health Organization in March 2020 is a global challenge. This has drawn research interest in various fields, such as drug design, image analysis of COVID-19 CT scans (Hemalatha et al., 2020), social distance monitoring, pandemic spread control, and virology. Our focus is on virology, the study of the DNA or RNA structure of the virus. In this chapter, we have used the predictive analysis of machine-learning techniques to classify an unknown genomic sequence into a known sequence with similar properties, traits, or characteristics. Early prediction and classification help in better treatment and reduced complications. COVID-19 is a single-strand RNA virus which has a huge diversity of genome sequences; analyzing the genome structure, and classifying it, is a great challenge. Other respiratory influence viruses, such as severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS), have a similar RNA structure, so earlier classification is required. The complete reference genome of various viruses like SARS, MERS, and COVID-19 was downloaded from the National Center for Biotechnology Information (NCBI) database. The dataset contains around 8,100 instances of the DNA sequence of different classes (COVID, MERS, and SARS). The dataset is split into a training and test set, with a respective 80/20% distribution. In our work, we have compared the different methods of DNA sequence classification in terms of their accuracy, precision, and recall. First, the state-of-the-art classification models like Bayes net, decision table, locally weighted learning, random forest, and random tree is used to classify the genomic sequence. Secondly, the raw genomic sequence is treated as strings of characters, and the popular natural language processing technique called k-mer (Greiet and Jasper 2019) counting is used to classify the sequence. The DNA sequence is converted into overlapping words of size 6. With k-mer words, we have created a bag-of-words model to apply the natural language processing technique. This, in turn, is converted into a uniform length vector and is classified using the multinomial naive Bayes classifier with an accuracy of around 96.9%. Finally, the classification is improved with a deep learning neural network model. The raw DNA sequence is represented as a one-hot encoded vector. The vectors are converted or combined into 2D array and trained using a convolutional neural network (CNN). The CNN architecture contains subsequent layers of convolution and pooling layers. The first layer has 64 filters with a convolutional size of 6 and the second layer has 32 filters of size 3, followed by flatten and dense layers. With this model, we are able to improve the accuracy of classification to around 97.9%. |
| Related Links | https://api.taylorfrancis.com/content/chapters/edit/download?identifierName=doi&identifierValue=10.1201/9781003141105-9&type=chapterpdf |
| Ending Page | 185 |
| Page Count | 17 |
| Starting Page | 169 |
| DOI | 10.1201/9781003141105-9 |
| Language | English |
| Publisher | Informa UK Limited |
| Publisher Date | 2021-08-13 |
| Access Restriction | Open |
| Subject Keyword | Book Name: Intelligent Computing Applications for Covid-19 Deep Learning Sequence Classification Neural Diversity Structure Treatment |
| Content Type | Text |
| Resource Type | Chapter |