NDLI: MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics

Content Provider	Springer Nature : BioMed Central
Author	Bredesen, Bjørn André Rehmsmeier, Marc
Abstract	Background Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs. Results We here present an expanded suite for modelling CRE sequences in terms of motif occurrence combinatorics—Motif Occurrence Combinatorics Classification Algorithms (MOCCA). MOCCA contains efficient implementations of several modelling methods, including SVM-MOCCA, and a new method, RF-MOCCA, a Random Forest–derivative of SVM-MOCCA. We used SVM-MOCCA and RF-MOCCA to model Drosophila PREs and BEs in cross-validation experiments, making this the first study to model PREs with Random Forests and the first study that applies the hierarchical MOCCA approach to the prediction of BEs. Both models significantly improve generalization to PREs and boundary elements beyond that of previous methods—including 4-spectrum and motif occurrence frequency Support Vector Machines and Random Forests—, with RF-MOCCA yielding the best results. Conclusion MOCCA is a flexible and powerful suite of tools for the motif-based modelling of CRE sequences in terms of motif composition. MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix (PWM) motifs. For ease of use, MOCCA implements generation of negative training data, and additionally a mode that requires only that the user specifies positives, motifs and a genome. MOCCA is licensed under the MIT license and is available on Github at https://github.com/bjornbredesen/MOCCA .
Related Links	https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/s12859-021-04143-2.pdf
Ending Page	11
Page Count	11
Starting Page	1
File Format	HTM / HTML
ISSN	14712105
DOI	10.1186/s12859-021-04143-2
Journal	BMC Bioinformatics
Issue Number	1
Volume Number	22
Language	English
Publisher	BioMed Central
Publisher Date	2021-05-07
Access Restriction	Open
Subject Keyword	Bioinformatics Microarrays Computational Biology Computer Appl. in Life Sciences Algorithms Cis-regulatory element Motif Machine learning Support vector machine Random forest Computational Biology/Bioinformatics
Content Type	Text
Resource Type	Article
Subject	Molecular Biology Biochemistry Computer Science Applications Applied Mathematics Structural Biology
Journal Impact Factor	2.9/2023
5-Year Journal Impact Factor	3.6/2023

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Predicting RNA-Protein Interactions Using Only Sequence Information

Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs

ViralmiR: a support-vector-machine-based method for predicting viral microRNA precursors

Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs

LDsplit: screening for cis-regulatory motifs stimulating meiotic recombination hotspots by analysis of DNA sequence polymorphisms

Semi-supervised protein subcellular localization

BioWord: A sequence manipulation suite for Microsoft Word

Microarray-based cancer prediction using single genes

A structural SVM approach for reference parsing

MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics

Similar Documents

Predicting RNA-Protein Interactions Using Only Sequence Information

Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs

ViralmiR: a support-vector-machine-based method for predicting viral microRNA precursors

Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs

LDsplit: screening for cis-regulatory motifs stimulating meiotic recombination hotspots by analysis of DNA sequence polymorphisms

Semi-supervised protein subcellular localization

BioWord: A sequence manipulation suite for Microsoft Word

Microarray-based cancer prediction using single genes

A structural SVM approach for reference parsing

MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics