A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC

by Jason Bennett, Mikhail Pomaznoy, Akul Singhania, Bjoern Peters Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that t he number of independent samples is typically much lower (14,000). To address this, it would be desirable to reduce the gathered data ’s dimensionality without losing information. Clustering genes into discrete modules is one of the most commonly used tools to accomplish this task. While there are multiple clustering approaches, there is a lack of informative metrics available to evaluate the resultant clusters’ biological qua lity. Here we present a metric that incorporates known ground truth gene sets to quantify gene clusters’ biological quality derived from standard clustering techniques. The GECO (Ground truth Evaluation of Clustering Outcomes) metric demonstrates that quantitative and repeatable scoring of gene cl usters is not only possible but computationally lightweight and robust. Unlike current methods, it allows direct comparison between gene clusters generated by different clustering techniques. It also reveals that current cluster analysis techniques often underestimate the number of clusters that sho uld be formed from a dataset, which leads to fewer clusters of lower quality. As a test case, we ...
Source: PLoS Computational Biology - Category: Biology Authors: Source Type: research
More News: Biology | Genetics | Study