GENEBENCH: EVALUATION OF GENE FINDERS AND DATASET CREATION SERVER

INSTITUTE OF MICROBIAL TECHNOLOGY
BIOINFORMATICS CENTER

DATASETS AVAILABLE IN GENEBENCH SERVER:
Original Sites	Local Site
HMR195 Dataset (Rogic et al., 2001)	download
Burset/Guigo96 Dataset (Burset and Guigo, 1996)	download
Reese/Kulp Human Dataset (Reese and Kulp, 1996)	download
Reese/Kulp Drosophila melanogaster Dataset (Reese and Kulp, 1996)	download
Guigo2000 Dataset (Guigo et al., 2000)	download
Fickett-Tung92 Dataset (Fickett and Tung, 1992)	download
Drosophila AdH Region Dataset (Reese et al., 2000 )	download

*Datasets for local use were downloaded on 15^th-March-2004.

Brief description of datasets available in GENEBENCH server:

HMR195 Dataset:

DNA sequences were extracted form the GenBank release 111.0 (April 1999) to date of study (Rogic et al., 2001). Source organisms were--H. sapiens,M. musculus, R. norvegicus. The ratio of human:mouse:rat sequences is 103:82:10. The mean length is 7096 bp. The number of single exon genes is 43 and mulit-exon genes is 152 with average number of 4.86 exons per gene and mean exon length 208 bp, and mean intron length 1015 bp. The porportion of coding sequences is 14% against non coding intron sequence of 46% and intergenic region of 40%.
Burset/Guigo96 Dataset:

The DNA sequences were extracted from GenBank release 85.0 (October 15,1994) from the vertebrate divisions. Source organisms were all vertebrate organisms. A total of 570 sequences were obtained after clean up procedure (Burset and Guigo, 1996) totalling 2,892,149 bp. There were 2649 coding exons, correcponding to 444,498 coding bp (~15%). All the sequences are having multi-exon genes.
Reese/Kulp Human Dataset:

GENIE gene finding data set, containing a total of 793 unrelated human genes. This data set was used to train the GENIE gene finding system (WWW-access) developed at LBNL and UC Santa Cruz. The last update was done in March 1998 using GenBank v.105. Please see the documentation file for further information, including links to previous versions of this data collection.
Reese/Kulp Dataset of Drosophila melanogaster:

The GENIE set of unrelated D. melanogaster genes . This data set was used to train the GENIE gene finding system developed at LBNL and UCSC. The last update was done in October 1998 using GenBank v.109. Please see the documentation file for further information, including links to previous versions of this data collection. This data set is developed by Martin Reese (LBNL) with help from Uwe Ohler (University of Erlangen), David Kulp (UCSC) and Andrew Gentles (Stanford). It has 416 gene sequences including 275 multi-exon and 141 single exon genes.
Guigo2000 Dataset:

Two sets of sequences were developed. First, a typical benchmark set made of sequences from the EMBL database release 50 (1997) that included 178 human genomic sequences (h178) coding for single complete genes for which both the mRNA and the coding exons are known. Second, a semi-artficial set of genomic sequences consisting of 42 sequences in which accurate gene-annotation is guaranteed. The h178 set has 50% G+C content, has an average length of 7169 bp with 1 gene each and 5.1 exons per sequence. The semi-artificial sequences have an average length of 177160 bp with 4.1 genes each sequence and 21 exons on average per sequence nd has a G+C content of 40%.
Fickett-Tung92 Dataset:

All data were taken from the GenBank collection of human nucleotide sequence data on May 30, 1992. All E. coli sequences were extracted on June 28, 1992. For the primary benchmark, successive non-overlapping windos of length 54 bases were taken from all human genomic sequences. Window length of 108 and 162 were also obtained. Each set of window is split and one set is used for training and and other used for testing the accuracy.
Drosophila AdH Region Dataset:

Used for Genome Annotation Assessment Project (GASP) 2000 in Drosophila (Reese et al., 2000). The total size of Adh region is 2.9 Mb. Presently estimated to contain over 200 genes.