NDLI: A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies

Content Provider	Springer Nature : BioMed Central
Author	Van, Richard Alvarez, Daniel Mize, Travis Gannavarapu, Sravani Chintham Reddy, Lohitha Nasoz, Fatma Han, Mira V.
Abstract	Background RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. Results We aimed to investigate the impact of data preprocessing steps—focusing on normalization, batch effect correction, and data scaling—through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. Conclusion By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.
Related Links	https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/s12859-024-05801-x.pdf
Ending Page	22
Page Count	22
Starting Page	1
File Format	HTM / HTML
ISSN	14712105
DOI	10.1186/s12859-024-05801-x
Journal	BMC Bioinformatics
Issue Number	1
Volume Number	25
Language	English
Publisher	BioMed Central
Publisher Date	2024-05-08
Access Restriction	Open
Subject Keyword	Bioinformatics Microarrays Computational Biology Computer Appl. in Life Sciences Algorithms RNA-Seq Classification Cancer Batch effect correction Normalization Data scaling Computational Biology/Bioinformatics
Content Type	Text
Resource Type	Article
Subject	Molecular Biology Biochemistry Computer Science Applications Applied Mathematics Structural Biology
Journal Impact Factor	2.9/2023
5-Year Journal Impact Factor	3.6/2023

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data

FastqPuri: high-performance preprocessing of RNA-seq data

Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality

NDRindex: a method for the quality assessment of single-cell RNA-Seq preprocessing data

ARPIR: automatic RNA-Seq pipelines with interactive report

Analysis of single-cell RNA sequencing data based on autoencoders

Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data

Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data

A statistical normalization method and differential expression analysis for RNA-seq data between different species

A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies

Similar Documents

Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data

FastqPuri: high-performance preprocessing of RNA-seq data

Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality

NDRindex: a method for the quality assessment of single-cell RNA-Seq preprocessing data

ARPIR: automatic RNA-Seq pipelines with interactive report

Analysis of single-cell RNA sequencing data based on autoencoders

Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data

Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data

A statistical normalization method and differential expression analysis for RNA-seq data between different species

A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies