Distinguishing the disease-associated SNPs based on composition frequency analysis

AbstractSingle-nucleotide polymorphism (SNP) is a basical variation in genome. When SNPs occur at the binding sites of microRNA, they can influence the binding efficiency, cause a fluctuation of the mRNA in vivo, and thus arouse posttranscriptional level abnormality. Therefore, SNP has a strong correlation with diseases. Although enormous SNPs have been experimentally identified, only a tiny proportion of them are truly disease-associated SNPs (dSNPs) that relate to microRNA modification and then are involved in disease causing process. Therefore, it is important to distinguish dSNPs from the usual SNPs. Analysis here shows that composition is different between sequence segments centered by dSNP and SNP. Inspired by the composition, transition and distribution features which are meaningful and effective in characterizing proteins ’ sequence information, we improved and applied it to represent the frequency and physicochemical properties of a gene sequence. Binary encoding scheme was also used for further labelling four nucleic acids (A, T, C, and G). First, clustering analysis was performed to gain reasonable negative samp les. Then, optimization tests were implemented on different ratios of positive vs negative samples and different feature subsets retrieved by evaluation method ofF score. The optimal model constructed by random forest achieves an accuracy of more than 90% on the testing data set. Moreover, the promising results of the external validation also demonstrate ...
Source: Interdisciplinary Sciences, Computational Life Sciences - Category: Bioinformatics Source Type: research