NDLI: ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest

Content Provider	Springer Nature : BioMed Central
Author	Luo, Junwei Feng, Yading Wu, Xuyang Li, Ruimin Shi, Jiawei Chang, Wenjing Wang, Junfeng
Abstract	Background Cancer subtype classification is helpful for personalized cancer treatment. Although, some approaches have been developed to classifying caner subtype based on high dimensional gene expression data, it is difficult to obtain satisfactory classification results. Meanwhile, some cancers have been well studied and classified to some subtypes, which are adopt by most researchers. Hence, this priori knowledge is significant for further identifying new meaningful subtypes. Results In this paper, we present a combined parallel random forest and autoencoder approach for cancer subtype identification based on high dimensional gene expression data, ForestSubtype. ForestSubtype first adopts the parallel RF and the priori knowledge of cancer subtype to train a module and extract significant candidate features. Second, ForestSubtype uses a random forest as the base module and ten parallel random forests to compute each feature weight and rank them separately. Then, the intersection of the features with the larger weights output by the ten parallel random forests is taken as our subsequent candidate features. Third, ForestSubtype uses an autoencoder to condenses the selected features into a two-dimensional data. Fourth, ForestSubtype utilizes k-means++ to obtain new cancer subtype identification results. In this paper, the breast cancer gene expression data obtained from The Cancer Genome Atlas are used for training and validation, and an independent breast cancer dataset from the Molecular Taxonomy of Breast Cancer International Consortium is used for testing. Additionally, we use two other cancer datasets for validating the generalizability of ForestSubtype. ForestSubtype outperforms the other two methods in terms of the distribution of clusters, internal and external metric results. The open-source code is available at https://github.com/lffyd/ForestSubtype . Conclusions Our work shows that the combination of high-dimensional gene expression data and parallel random forests and autoencoder, guided by a priori knowledge, can identify new subtypes more effectively than existing methods of cancer subtype classification.
Related Links	https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/s12859-023-05412-y.pdf
Ending Page	19
Page Count	19
Starting Page	1
File Format	HTM / HTML
ISSN	14712105
DOI	10.1186/s12859-023-05412-y
Journal	BMC Bioinformatics
Issue Number	1
Volume Number	24
Language	English
Publisher	BioMed Central
Publisher Date	2023-07-19
Access Restriction	Open
Subject Keyword	Bioinformatics Microarrays Computational Biology Computer Appl. in Life Sciences Algorithms Cancer subtyping Random forest Gene expression data Machine learning Auto Encoder Computational Biology/Bioinformatics
Content Type	Text
Resource Type	Article
Subject	Molecular Biology Biochemistry Computer Science Applications Applied Mathematics Structural Biology
Journal Impact Factor	2.9/2023
5-Year Journal Impact Factor	3.6/2023

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Deep learning approach for cancer subtype classification using high-dimensional gene expression data

BCDForest: a boosting cascade deep forest model towards the classification of cancer subtypes based on gene expression data

Multi-dimensional data integration algorithm based on random walk with restart

A laminar augmented cascading flexible neural forest model for classification of cancer subtypes based on gene expression data

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

Gene selection and classification of microarray data using random forest

SVExpress: identifying gene features altered recurrently in expression with nearby structural variant breakpoints

A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression

Improved high-dimensional prediction with Random Forests by the use of co-data

ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest

Similar Documents

Deep learning approach for cancer subtype classification using high-dimensional gene expression data

BCDForest: a boosting cascade deep forest model towards the classification of cancer subtypes based on gene expression data

Multi-dimensional data integration algorithm based on random walk with restart

A laminar augmented cascading flexible neural forest model for classification of cancer subtypes based on gene expression data

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

Gene selection and classification of microarray data using random forest

SVExpress: identifying gene features altered recurrently in expression with nearby structural variant breakpoints

A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression

Improved high-dimensional prediction with Random Forests by the use of co-data

ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest