NDLI: Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data

Content Provider	Springer Nature : BioMed Central
Author	Castelli, Pierluigi De Ruvo, Andrea Bucciacchio, Andrea D’Alterio, Nicola Cammà, Cesare Di Pasquale, Adriano Radomski, Nicolas
Abstract	Background Genomic data-based machine learning tools are promising for real-time surveillance activities performing source attribution of foodborne bacteria such as Listeria monocytogenes. Given the heterogeneity of machine learning practices, our aim was to identify those influencing the source prediction performance of the usual holdout method combined with the repeated k-fold cross-validation method. Methods A large collection of 1 100 L. monocytogenes genomes with known sources was built according to several genomic metrics to ensure authenticity and completeness of genomic profiles. Based on these genomic profiles (i.e. 7-locus alleles, core alleles, accessory genes, core SNPs and pan kmers), we developed a versatile workflow assessing prediction performance of different combinations of training dataset splitting (i.e. 50, 60, 70, 80 and 90%), data preprocessing (i.e. with or without near-zero variance removal), and learning models (i.e. BLR, ERT, RF, SGB, SVM and XGB). The performance metrics included accuracy, Cohen’s kappa, F1-score, area under the curves from receiver operating characteristic curve, precision recall curve or precision recall gain curve, and execution time. Results The testing average accuracies from accessory genes and pan kmers were significantly higher than accuracies from core alleles or SNPs. While the accuracies from 70 and 80% of training dataset splitting were not significantly different, those from 80% were significantly higher than the other tested proportions. The near-zero variance removal did not allow to produce results for 7-locus alleles, did not impact significantly the accuracy for core alleles, accessory genes and pan kmers, and decreased significantly accuracy for core SNPs. The SVM and XGB models did not present significant differences in accuracy between each other and reached significantly higher accuracies than BLR, SGB, ERT and RF, in this order of magnitude. However, the SVM model required more computing power than the XGB model, especially for high amount of descriptors such like core SNPs and pan kmers. Conclusions In addition to recommendations about machine learning practices for L. monocytogenes source attribution based on genomic data, the present study also provides a freely available workflow to solve other balanced or unbalanced multiclass phenotypes from binary and categorical genomic profiles of other microorganisms without source code modifications.
Related Links	https://bmcgenomics.biomedcentral.com/counter/pdf/10.1186/s12864-023-09667-w.pdf
Ending Page	19
Page Count	19
Starting Page	1
File Format	HTM / HTML
ISSN	14712164
DOI	10.1186/s12864-023-09667-w
Journal	BMC Genomics
Issue Number	1
Volume Number	24
Language	English
Publisher	BioMed Central
Publisher Date	2023-09-22
Access Restriction	Open
Subject Keyword	Life Sciences Microarrays Proteomics Animal Genetics and Genomics Microbial Genetics and Genomics Plant Genetics and Genomics Listeria monocytogenes Source attribution Machine learning Genomic data
Content Type	Text
Resource Type	Article
Subject	Biotechnology Genetics
Journal Impact Factor	3.5/2023
5-Year Journal Impact Factor	4.1/2023

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Genomic dissection of the most prevalent Listeria monocytogenes clone, sequence type ST87, in China

Development of ListeriaBase and comparative analysis of Listeria monocytogenes

Virulence characterization and comparative genomics of Listeria monocytogenes sequence type 155 strains

Genes significantly associated with lineage II food isolates of Listeria monocytogenes

Evaluating the accuracy of Listeria monocytogenes assemblies from quasimetagenomic samples using long and short reads

Genomic prediction using machine learning: a comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data

Dynamics of mobile genetic elements of Listeria monocytogenes persisting in ready-to-eat seafood processing plants in France

An advanced bioinformatics approach for analyzing RNA-seq data reveals sigma H-dependent regulation of competence genes in Listeria monocytogenes

Machine learning classification of archaea and bacteria identifies novel predictive genomic features

Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data

Similar Documents

Genomic dissection of the most prevalent Listeria monocytogenes clone, sequence type ST87, in China

Development of ListeriaBase and comparative analysis of Listeria monocytogenes

Virulence characterization and comparative genomics of Listeria monocytogenes sequence type 155 strains

Genes significantly associated with lineage II food isolates of Listeria monocytogenes

Evaluating the accuracy of Listeria monocytogenes assemblies from quasimetagenomic samples using long and short reads

Genomic prediction using machine learning: a comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data

Dynamics of mobile genetic elements of Listeria monocytogenes persisting in ready-to-eat seafood processing plants in France

An advanced bioinformatics approach for analyzing RNA-seq data reveals sigma H-dependent regulation of competence genes in Listeria monocytogenes

Machine learning classification of archaea and bacteria identifies novel predictive genomic features

Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data