Gene Selection and Sample Classification with Applications to TCGA Data

NIH Director's Seminar Earlier we developed a computational algorithm, GAKNN, for assessing the importance of genes for sample classification based on expression data. GAKNN combines a genetic algorithm (GA) and the k-nearest neighbor (KNN) method to identify many predictive sets of genes, each of which jointly can distinguish two classes of samples based on a training set. The relative importance of a gene for sample classification can then be assessed based on the proportion of predictive sets that contain that gene. Now, we have extended the algorithm to account for multiple classes in the data. Furthermore, instead of giving a deterministic classification for the “test set” samples, the modified GAKNN now provides the frequencies/probabilities with which the “test set” samples are being classified into each of the classes. We are currently testing our method using the gene expression and DNA methylation data from the Cancer Genome Atlas (TCGA). In this talk, I will show some of the preliminary results we obtained from analyzing the expression data of the 602 human normal tissues and the expression data of 336 skin cutaneous melanoma (SKCM) samples. For the 602 normal tissue samples, with 11 types/classes, we were able to correctly classified 88% of them. For the SKCM data, we carried out two related analyses. In the first analysis, we fixed the clinical classification of the SKCM tumors (metastatic or primary). In the second analysis, we give each sample a small ...
Source: Videocast - All Events - Category: Journals (General) Tags: Upcoming Events Source Type: video