Principal Components of Genetic Sequences: Correlations and Significance

V. M. Efimov, K. V. Efimov, V. Yu. Kovaleva, Yu. G. Matushkin

Результат исследования: Научные публикации в периодических изданияхстатьярецензирование


Any numerical series can be decomposed into principal components using singular spectral analysis. We have recently proposed a new analysis method ‒ PCA- Seq, which allows calculating numerical principal components for a sequence of elements of any type. In particular, the sequence may be composed of nucleotide base pairs or amino acid residues. Two questions inevitably arise about interpretation of the obtained principal components and about the assessment of their reliability. For interpretation of the symbolic sequence principal components, it is reasonable to evaluate their correlations with numerical characteristics of the sequence elements. To assess the significance of correlations between sequences, one should bear in mind that standard significance criteria are based on the assumption of independence of observations, which, as a rule, is not fulfilled for real sequences. The article discusses the use of an anchor bootstrap technique for these purposes also previously developed by the authors of the article. In this approach it is assumed, that points of a metric space can represent the objects. When taken together they make up some fixed structure in it, in particular, a sequence. The objects are assigned the same random integer weights as in the classical bootstrap. This is sufficient to obtain the bootstrap distribution of the correlation coefficients and assess their significance. The coding sequence of the SLC9A1 gene (synonyms APNH, NHE1, PPP1R143) were taken as an example of use the anchor bootstrap technique in the genetic sequence analysis. Significant correlations of the first principal component were revealed with the hydrophobicity / “transmembraneity” of the corresponding fragments of the amino acid sequence, the phenylalanine content in them, as well as the difference in the T- and A-content in the corresponding nucleotide fragments. Earlier a similar pattern was found by other authors for other genes. Very likely, that it is of a more general nature.

Язык оригиналаанглийский
Страницы (с-по)299-316
Число страниц18
ЖурналMathematical Biology and Bioinformatics
Номер выпуска2
СостояниеОпубликовано - 2021

Предметные области OECD FOS+WOS



Подробные сведения о темах исследования «Principal Components of Genetic Sequences: Correlations and Significance». Вместе они формируют уникальный семантический отпечаток (fingerprint).