NDLI: Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality

Content Provider	Springer Nature : BioMed Central
Author	Rendleman, Michael C. Buatti, John M. Braun, Terry A. Smith, Brian J. Nwakama, Chibuzo Beichel, Reinhard R. Brown, Bart Casavant, Thomas L.
Abstract	Background In the era of precision oncology and publicly available datasets, the amount of information available for each patient case has dramatically increased. From clinical variables and PET-CT radiomics measures to DNA-variant and RNA expression profiles, such a wide variety of data presents a multitude of challenges. Large clinical datasets are subject to sparsely and/or inconsistently populated fields. Corresponding sequencing profiles can suffer from the problem of high-dimensionality, where making useful inferences can be difficult without correspondingly large numbers of instances. In this paper we report a novel deployment of machine learning techniques to handle data sparsity and high dimensionality, while evaluating potential biomarkers in the form of unsupervised transformations of RNA data. We apply preprocessing, MICE imputation, and sparse principal component analysis (SPCA) to improve the usability of more than 500 patient cases from the TCGA-HNSC dataset for enhancing future oncological decision support for Head and Neck Squamous Cell Carcinoma (HNSCC). Results Imputation was shown to improve prognostic ability of sparse clinical treatment variables. SPCA transformation of RNA expression variables reduced runtime for RNA-based models, though changes to classifier performance were not significant. Gene ontology enrichment analysis of gene sets associated with individual sparse principal components (SPCs) are also reported, showing that both high- and low-importance SPCs were associated with cell death pathways, though the high-importance gene sets were found to be associated with a wider variety of cancer-related biological processes. Conclusions MICE imputation allowed us to impute missing values for clinically informative features, improving their overall importance for predicting two-year recurrence-free survival by incorporating variance from other clinical variables. Dimensionality reduction of RNA expression profiles via SPCA reduced both computation cost and model training/evaluation time without affecting classifier performance, allowing researchers to obtain experimental results much more quickly. SPCA simultaneously provided a convenient avenue for consideration of biological context via gene ontology enrichment analysis.
Related Links	https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/s12859-019-2929-8.pdf
Ending Page	9
Page Count	9
Starting Page	1
File Format	HTM / HTML
ISSN	14712105
DOI	10.1186/s12859-019-2929-8
Journal	BMC Bioinformatics
Issue Number	1
Volume Number	20
Language	English
Publisher	BioMed Central
Publisher Date	2019-06-17
Access Restriction	Open
Subject Keyword	Bioinformatics Microarrays Computational Biology Computer Appl. in Life Sciences Algorithms Machine learning hnscc tcga Dimensionality reduction Gene ontology enrichment analysis Decision support Unsupervised transformation Computational Biology/Bioinformatics
Content Type	Text
Resource Type	Article
Subject	Molecular Biology Biochemistry Computer Science Applications Applied Mathematics Structural Biology
Journal Impact Factor	2.9/2023
5-Year Journal Impact Factor	3.6/2023

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Using transfer learning and dimensionality reduction techniques to improve generalisability of machine-learning predictions of mosquito ages from mid-infrared spectra

DGH-GO: dissecting the genetic heterogeneity of complex diseases using gene ontology

Gene-gene interaction filtering with ensemble of filters

False positive reduction in protein-protein interaction predictions using gene ontology annotations

CLEAN: CLustering Enrichment ANalysis

NIPS workshop on New Problems and Methods in Computational Biology

Gene set enrichment meta-learning analysis: next- generation sequencing versus microarrays

Nonlinear dimensionality reduction methods for synthetic biology biobricks’ visualization

BiNChE: A web tool and library for chemical enrichment analysis based on the ChEBI ontology

Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality

Similar Documents

Using transfer learning and dimensionality reduction techniques to improve generalisability of machine-learning predictions of mosquito ages from mid-infrared spectra

DGH-GO: dissecting the genetic heterogeneity of complex diseases using gene ontology

Gene-gene interaction filtering with ensemble of filters

False positive reduction in protein-protein interaction predictions using gene ontology annotations

CLEAN: CLustering Enrichment ANalysis

NIPS workshop on New Problems and Methods in Computational Biology

Gene set enrichment meta-learning analysis: next- generation sequencing versus microarrays

Nonlinear dimensionality reduction methods for synthetic biology biobricks’ visualization

BiNChE: A web tool and library for chemical enrichment analysis based on the ChEBI ontology

Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality