NDLI: PFClust: a novel parameter free clustering algorithm

Content Provider	Springer Nature : BioMed Central
Author	Mavridis, Lazaros Nath, Neetika Mitchell, John BO
Abstract	Background We present the algorithm PFClust (Parameter Free Clustering), which is able automatically to cluster data and identify a suitable number of clusters to group them into without requiring any parameters to be specified by the user. The algorithm partitions a dataset into a number of clusters that share some common attributes, such as their minimum expectation value and variance of intra-cluster similarity. A set of n objects can be clustered into any number of clusters from one to n, and there are many different hierarchical and partitional, agglomerative and divisive, clustering methodologies available that can be used to do this. Nonetheless, automatically determining the number of clusters present in a dataset constitutes a significant challenge for clustering algorithms. Identifying a putative optimum number of clusters to group the objects into involves computing and evaluating a range of clusterings with different numbers of clusters. However, there is no agreed or unique definition of optimum in this context. Thus, we test PFClust on datasets for which an external gold standard of ‘correct’ cluster definitions exists, noting that this division into clusters may be suboptimal according to other reasonable criteria. PFClust is heuristic in the sense that it cannot be described in terms of optimising any single simply-expressed metric over the space of possible clusterings. Results We validate PFClust firstly with reference to a number of synthetic datasets consisting of 2D vectors, showing that its clustering performance is at least equal to that of six other leading methodologies - even though five of the other methods are told in advance how many clusters to use. We also demonstrate the ability of PFClust to classify the three dimensional structures of protein domains, using a set of folds taken from the structural bioinformatics database CATH. Conclusions We show that PFClust is able to cluster the test datasets a little better, on average, than any of the other algorithms, and furthermore is able to do this without the need to specify any external parameters. Results on the synthetic datasets demonstrate that PFClust generates meaningful clusters, while our algorithm also shows excellent agreement with the correct assignments for a dataset extracted from the CATH part-manually curated classification of protein domain structures.
Related Links	https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/1471-2105-14-213.pdf
Ending Page	21
Page Count	21
Starting Page	1
File Format	HTM / HTML
ISSN	14712105
DOI	10.1186/1471-2105-14-213
Journal	BMC Bioinformatics
Issue Number	1
Volume Number	14
Language	English
Publisher	BioMed Central
Publisher Date	2013-07-03
Access Restriction	Open
Subject Keyword	Bioinformatics Microarrays Computational Biology Computer Appl. in Life Sciences Algorithms Cluster Algorithm Synthetic Dataset Average Similarity Rand Index Original Cluster Computational Biology/Bioinformatics
Content Type	Text
Resource Type	Article
Subject	Molecular Biology Biochemistry Computer Science Applications Applied Mathematics Structural Biology
Journal Impact Factor	2.9/2023
5-Year Journal Impact Factor	3.6/2023

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

A systematic comparison of genome-scale clustering algorithms

GenClust: A genetic algorithm for clustering gene expression data

Finding reproducible cluster partitions for the k-means algorithm

Effect of data normalization on fuzzy clustering of DNA microarray data

Clustering metagenomic sequences with interpolated Markov models

ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use

Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement

A novel hierarchical clustering algorithm for gene sequences

Semi-supervised adaptive-height snipping of the hierarchical clustering tree

PFClust: a novel parameter free clustering algorithm

Similar Documents

A systematic comparison of genome-scale clustering algorithms

GenClust: A genetic algorithm for clustering gene expression data

Finding reproducible cluster partitions for the k-means algorithm

Effect of data normalization on fuzzy clustering of DNA microarray data

Clustering metagenomic sequences with interpolated Markov models

ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use

Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement

A novel hierarchical clustering algorithm for gene sequences

Semi-supervised adaptive-height snipping of the hierarchical clustering tree

PFClust: a novel parameter free clustering algorithm