Genome databases continue to expand with no change in the basic format of sequence data. The prevalent use of the Classic alignment based search tools like BLAST have significantly pushed the limits of Genome Isolate research. The relatively new frontier of Metagenomic research deals with thousands of diverse genomes with newer demands beyond the current homologue search and analysis. Compressing sequence data into a complex form could facilitate a broader range of sequence analyses. To this end, this research explores reorganizing sequence data as complex Markov signatures also known as Extensible Markov Models. Markov models have found successful application in Biological Sequence analysis applications through small, but important extensions to the original theory of Markov Chains. Extensible Markov Model (EMM) offers a novel Quasi-alignment complement to the classic alignment based homologous sequence search methods like BLAST. EMM based BioInformatic analysis (EMMBA) incorporates automatic learning which allows the Markov chain creation dynamically. Oligonucletide or Genomic word frequencies form the core sequence data in alignment free methods. EMMBA extends the Karlin-Altschul statistics to bring forth an analogous E-Score statistical significance to the Quasi-alignment domain. By consolidating a community of sequences into a single searchable profile, EMM methodology further reduces the search space for classification. Through dynamic generation of the score matrix for each community profile, EMMBA fine tunes the score assignments. Each evaluation iteratively adjusts the profile score matrix to account for point probabilities of the query to ensure Karlin-Altschul assumptions are satisfied to derive meaningful statistical significance. The presence of multiple Quasi-alignments resembles multiple local alignments of BLAST. Quasi-alignments are scored based on a difference distribution of Gumbel scores. Species signature profiles allow for statistical validation of novel species identification. Working in EMM transformation space speeds up classification and generates distance matrix for differentiation. The techniques and metrics presented are validated using the microbial 16s rRNA sequence data from NCBI.


Bioinformatics | Computational Biology