FTGPRED: Gene Identification using Fourier Transformation

Performance of FTGPRED

The performance of FTGPred programs is compared with existing methods. We predict genes in 11 genomes using existing methods like GeneMarkS, GeneMark.hmm, EasyGene, Glimmer. The Sensitivity (TP/TP+FN) and Specificity (TP/TP+FP) of the different methods is computed based on the GenBank annotation described below. The Performance of FTGPred programs is found to be comparable to other methods. In general, no single program consistently outperforms the remaining programs for all genomes. Neither does any single program performs consistently poor for all genomes. For example, FTGPred programs perform better in case of Aeropyrum pernix genome compared to other programs (except Glimmer), but has low specificity in case of Mycobacterium tuberculosis genome and performs poorly in case of Mycoplasma genitalium genome. For all other genomes FFT-based programs are reasonably accurate.

In the next step, we compare predicted genes with experimentally annotated genes from among the genbank annotation and identified the genes missed by these methods. We then predict genes using FFT techniques used in FTGPred. It is observed that most of genes missed by existing methods was predicted by FFT based methods. This demonstrates the advantage of FFT based methods on knowledge-based methods. The complete report of 11 genomes for all programs is available below.

We do not claim that our FFT based methods perform better than knowledge-based methods. However, we do claim that our methods can predict the new/novel genes that are missed by these knowledge-based methods. This is due the fact that knowledge-based methods learn from known genes so their prediction accuracy depends to a certain extent on similarity between query sequence and genes used for training. These methods perform poor or fails on new/novel gene that does not have any similarity with existing genes. Moreover, knowledge-based methods are mostly organism specific. On the other hand, FFT based method are ab initio methods so their performance is not affected by similarity between query gene and existing genes. These FFT based methods are not organism specific and can work in any genome. The users are recommended to use FFT based methods as complementary to knowledge-based methods for gene prediction. The server is based on our paper Issac et al., 2002, Bioinformatics, vol. 18, issue 1, pp 196-197, where we demonstrate the novelty of our approach.

The following programs were taken for evaluation against 11 microbial genomes:
  1. EasyGene 1.0  http://www.cbs.dtu.dk/services/EasyGene/.
    The EasyGene 1.0 server produces a list of predicted genes given a sequence of prokaryotic DNA. Each prediction is attributed with a significance score (R-value) indicating how likely it is to be just a non-coding open reading frame rather than a real gene. The user needs only to specify the organism hosting the query sequence.

  2. Glimmer  http://www.tigr.org/software/glimmer/.
    Locally Installed Program version Glimmer 2.13 obtained from ftp://ftp.tigr.org/pub/software/Glimmer. Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria and archaea. Glimmer (Gene Locator and Interpolated Markov Modeler) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA. The Glimmer system consists of two main programs. The first of these is the training program, build-imm. This program takes an input set of sequences and builds and outputs the IMM for them. These sequences can be complete genes or just partial orfs. For a new genome, this training data can consist of those genes with strong database hits as well as very long open reading frames that are statistically almost certain to be genes. The second program is glimmer, which uses this IMM to identify putative genes in an entire genome. Glimmer automatically resolves conflicts between most overlapping genes by choosing one of them. It also identifies genes that are suspected to truly overlap, and flags these for closer inspection by the user. These ``suspect'' gene candidates have been a very small percentage of the total for all the genomes analyzed thus far.

  3. GeneMarkS  http://opal.biology.gatech.edu/GeneMark/genemarks.cgi.
    The models used by GeneMark.hmm 2.1 are derived in an iterative manner from the input sequence. This program was designed to analyze anonymous prokaryotic genome-sized sequences.

  4. GeneMarkHMM  http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi.
    The GeneMark program is accessing the protein-coding potential of a DNA sequence (within a sliding window) by using Markov models of coding and non-coding regions. This approach is sensitive to local variations of coding potential, and the GeneMark graph shows details of the coding potential distribution along a sequence. GeneMark.hmm is predicting genes and intergenic regions in a sequence as a whole using Hidden Markov models with a hidden state network reflecting the "grammar" of gene organization. The GeneMark.hmm program identifies the maximum likely parse of the whole sequence into protein coding genes (with possible introns) and intergenic regions.

NOTE:: Description about the different programs were adapted from the web pages of these individual programs.

The following 11 microbial genomes were taken from NCBI Genomic Database as datasets:
  1. Aeropyrum pernix [Archaeal][GI:NC_000854]. 07-APR-2003. [Annotation].

  2. Bacillus anthracis str Ames [Bacteria][GI:NC_003997]. 30-APR-2003. [Annotation].

  3. Borrelia burgdorferi [Bacteria][GI:NC_001318]. 01-AUG-2003. [Annotation].

  4. Chlamydophilus pneumoniae AR39 [Bacteria][GI:NC_002179]. 01-AUG-2003. [Annotation].

  5. Campylobacter jejuni [Bacteria][GI:NC_002163]. 19-MAR-2003. [Annotation].

  6. Escherichia coli O157:H7 [Bacteria][GI:NC_002695]. 04-DEC-2003. [Annotation].

  7. Haemophilus influenzae Rd [Bacteria][GI:NC_000907]. 01-DEC-2003. [Annotation].

  8. Helicobacter pylori 26695 [Bacteria][GI:NC_000915]. 09-DEC-2002. [Annotation].

  9. Helicobacter pylori J99 [Bacteria][GI:NC_000921]. 09-DEC-2002. [Annotation].

  10. Mycobacterium tuberculosis H37:Rv [Bacteria][GI:NC_000962]. 31-DEC-2003. [Annotation].

  11. Mycoplasma genitalium [Bacteria][GI:NC_000908]. 10-DEC-2002. [Annotation].


Annotation from Genbank files were considered for evaluation. These annotation were obtained as *.ptt files from ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/"organism"/"gi number">.ptt. Measures that were calculated were the---True positives (Number of actual ORFs/Genes that were correctly predicted by the predictors. Correct predictions are those predictions that correctly predicts at least one or both of the ends, 5` or 3`, of the ORF/Gene.); False Negatives (Number of actual ORFs/Genes that were missed by the predictors); False Positives (Number of wrong ORFs/Genes that were predicted as correct by the predictors); Sensitivity is defined as Sen=True positives/(True Positives + False Negatives); Specificity is defined as Spe=True positives/(True Positives + False Positives).

  1. Performance of individual programs on various genomes
  2. Performance of FTGPRED algorithms on GENEs/ORFs missed by other programs