CRDD Logo
Home  | OSDD | New | Rewards | Challenges | OSDDpub | News | Forum | Indipedia | Drugpedia | FAQ| License
  

 Genome Annotation
 Proteome Annotation
 Potential Targets
 Protein Structure

 QSAR Techniques
 Docking & QSAR
 Chemoinformatics
 siRNA/miRNA

 Lead Optimization
 Pharmainformatics
 ADMET
 Clinical Informatics

 Expermentalists
 Virtual Trainees/Jobs
 Software Developers

 Library Interfaces
 Meta Servers
 Publishing Document
 Data on M.tb.

 Core Team
 Contact Address
 History of CRDD
     

Computational Tools for Genome Annotation

The sequencing techniques are increasingly becoming more advanced. Hence the number of sequenced genomes is also increasing exponentially. One of the major challenges in contemporary science is to annotate the available sequence data. Annotation defines the coding regions in the genome as well as their physical location. It also provides the number and spatial distribution of repeat regions and the evolutionary information about the whole genomes.

Several computational tools have been developed to cut down time and expense involved in the experimental procedure of annotation. Computational resources at CRDD have been classified in following categories:

Servers integrated at CRDD

ServerDescription
FTG A web server for locating probable protein coding region in nucleotide sequence using fourier tranform approach (Issac, B., Singh, H., Kaur, H. and Raghava, G.P.S. (2002) Bioinformatics 18:196).
EGPred This server allows to predict gene (protein coding regions) in eukaryote genomes that includes introns and exons, using similarity aided (double) and consensus Ab Intion methods. (Issac B, Raghava GP. (2004) Genome Res. 14(9):1756-66)
FTGPred A web server for predicting genes in a DNAsequence.
GWBLAST A genome wide blast server. It allow user to search ther sequence against sequenced genomes and annonated proteomes. This integrate various tools which allows analysys of BLAST SEARCH.
SVMgene It is a support vector based approach to identify the protein coding regions in human genomic DNA.
SRFSpectral Repeat Finder (SRF) is a program to find repeats through an analysis of the power spectrum of a given DNA sequence. By repeat we mean the repeated occurrence of a segment of N nucleotides within a DNA sequence. SRF is an ab initio technique as no prior assumptions need to be made regarding either the repeat length, its fidelity, or whether the repeats are in tandem or not (Sharma D, Issac B, Raghava GP, Ramaswamy R. (2004) Bioinformatics. 20(9):1405-12)
GWFASTA Genome Wise Sequence Similarity Search using FASTA. It allow user to search their sequence against sequenced genomes and their product proteome. This integrate various tools which allows analysys of FASTA search (Issac, B. and Raghava, G.P.S. (2002) Biotechniques 33:548-56).
GeneBench A suite of datasets and tools for evaluating gene prediction methods.
MyPatternMyPattern Finder is a program for detection of a 'motif' in DNA sequence by using an exact search method (Option A (1.0)) or an alignment technique (Option B (1.0)).


Meta-servers, web-servers and mirroring of web-servers and databases

 

Name

Can be used for

Algorithm

References

GeneMark

Archaea, Metagenomes ,Eukaryotes,Viruses, Phages, Plasmids, EST and cDNA

hidden Markov model

Besemer J. and Borodovsky M.
Nucleic Acids Research, 2005, Vol. 33, Web Server Issue, pp. W451-454

GeneHacker

 

Microbial genomes

Markov model

Yada.T , Hirosawa.M  DNA Res., 3, 335-361 (1996).

Syst. Mol. Biol. pp.252-260 (1996).

Syst. Mol. Biol. pp.354-357 (1997).

GeneWalker

 

Human

Hidden Markov Model

 

HMMgene (v. 1.1) 

 

vertebrate and C. elegans

Hidden Markov Model

A. Krogh: In Proc. of Fifth Int. Conf. on Intelligent Systems for Molecular Biology, ed. Gaasterland, T. et al., Menlo Park, CA: AAAI Press, 1997, pp. 179-186.

Chemgenome2.0

Prokaryotes

Ab-inito METHOD

Poonam Singhal, B. Jayaram, Surjit B. Dixit and David L. Beveridge. Prokaryotic Gene Finding based on Physicochemical Characteristics of Codons Calculated from Molecular Dynamics Simulations.Biophysical Journal,2008,Volume:94 Issue:11, 4173-4183 ]

Softberry Server

 

Bacteria ,Viruses and eukaryotes

HMM and similarity based searches

Solovyev V.V.,Salamov A.A., Lawrence C.B.
(Nucl.Acids Res.,1994,22,24,5156-5163).

Gene ID  

 

Animal, Human, Plants fungus, Protists

Neural Network

Blanco et.al., Genome Research 10(4):511-515 (2000).

GenScan

 

Vertebrates, Arabidopsis, Maize

Ab-inito Method

Burge and Karlin (1998)  Curr. Opin. Struct. Biol. 8, 346-354.

 

Web Interface on Libraries

 

Standalone Software

Name

Can be used for

Algorithm

References

GenomeThreader

 

Plants

Similarity-based gene prediction program where additional cDNA/EST and/or protein sequences are used to predict gene structures via spliced alignments

Gremme et al Information and Software Technology, 47(15):965-978, 2005

JIGSAW(formerly "Combiner")

 

Eukaryotic

multiple sources of evidence (output from gene finders, splice site prediction programs and sequence alignments to predict gene models)

Allen et al. Genome Biology 2007, 7(Suppl):S9.;

Allen and Salzberg  Bioinformatics 21(18): 3596-3603, 2005;

Allen et al. Genome Research, 14(1), 2004.

GlimmerHMM

Eukaryotic

GlimmerHMM is based on a Generalized Hidden Markov Model (GHMM). Although the gene finder conforms to the overall mathematical framework of a GHMM, additionally it incorporates splice site models adapted from the GeneSplicer program and a decision tree adapted from GlimmerM. It also utilizes Interpolated Markov Models for the coding and noncoding models . Currently, GlimmerHMM's GHMM structure includes introns of each phase, intergenic regions, and four types of exons (initial, internal, final, and single).

Majoros et al. Bioinformatics 20 2878-2879, 2004

GeneZilla

 

eukaryotic

GeneZilla is based on the Generalized Hidden Markov Model (GHMM). It evolved out of the ab initio eukaryotic gene finder TIGRscan, which was developed at The Institute for Genomic Research.

GeneZilla (formerly "TIGRscan") is briefly described in:

Majoros W, et al. (2004)

Bioinformatics 20, 2878-2879

The novel decoding algorithm used by GeneZilla is described in:

Majoros W. et al. (2005) BMC Bioinformatics 5:616.

Twinscan/N-SCAN (Ver 4.1.2)

Twinscan: Mammals, Caenorhabditis (worm), Dicot plants, and Cryptococci. N-SCAN: human and Drosophila

TWINSCAN extends the probability model of GENSCAN, allowing it to exploit homology between two related genomes. Separate probability models are used for conservation in exons, introns, splice sites, and UTRs, reflecting the differences among their patterns of evolutionary conservation.

 

N-SCAN (a.k.a. TWINSCAN 3.0) model the phylogenetic relationships between the aligned genome sequences, context dependent substitution rates, and insertions and deletions. N-SCAN Is created and used to generate predictions for the entire human genome and the genome of the fruit fly Drosophila melanogaster.

TWINSCAN: Gross and Brent. J Comput Biol. 2006 Mar;13(2):379-93.

Korf I, N-SCAN: Flicek et al Bioinformatics. 2001;17 Suppl 1:S140-8.

 

Manatee

 

prokaryotic and eukaryotic genomes

Manatee is a web-based gene evaluation and genome annotation tool that can view, modify, and store annotation for prokaryotic and eukaryotic genomes. The Manatee interface allows biologists to quickly identify genes and make high quality functional assignments using a multitude of genome analyses tools. These tools consist of, but are not limited to GO classifications, BER and blast search data, paralogous families, and annotation suggestions generated from automated analysis.

NA

EvoGene

NA

alignment of multiple genomic sequences

Pedersen and Hein. Bioinformatics (in press)

CRITICA

(Coding Region Identification Tool Invoking Comparative Analysis)

Prokaryotic

 

CRITICA combines traditional approaches to the problem with a novel comparative analysis. If, in a nucleotide alignment, a pair of ORFs can be found in which the conceptual translated products are more conserved than would be expected from the amount of conservation at the nucleotide level, this is evolutionary evidence that the DNA sequences are protein coding. Regions found by this method are used to generate traditional dicodon frequencies for further analysis and give the prediction about a probable protein coding region.

Badger and Olsen. Molecular Biology and Evolution, 16(4):512-524. 1999.

sgp2

 

 

Sgp2 predict genes by comparing anonymous genomic sequences from two different species. Further it combines tblastx, a sequence similarity search program, with geneid, an "ab initio" gene prediction program.

Parra et al. Genome Research 13(1):108-117 (2003)

Phat

 

Eukaryotes (Homo sapiens, Plasmodium falciparum, Plasmodium vivax)

Phat is a HMM-based genefinder, originally developed for genefinding in

Plasmodium falciparum.

 

Unpublished

EuGène

 

Eukaryotes

EuGène exploit probabilistic models like Markov models for discriminating coding from non coding sequences or to discriminate effective splice sites from false splice sites (using various mathematical models).

LNCS 2066, pp. 111-125, 2001