INSTITUTE OF MICROBIAL TECHNOLOGY
|DATASETS AVAILABLE IN GENEBENCH SERVER:|
|Original Sites||Local Site|
|HMR195 Dataset (Rogic et al., 2001)||download|
|Burset/Guigo96 Dataset (Burset and Guigo, 1996)||download|
|Reese/Kulp Human Dataset (Reese and Kulp, 1996)||download|
|Reese/Kulp Drosophila melanogaster Dataset (Reese and Kulp, 1996)||download|
|Guigo2000 Dataset (Guigo et al., 2000)||download|
|Fickett-Tung92 Dataset (Fickett and Tung, 1992)||download|
|Drosophila AdH Region Dataset (Reese et al., 2000 )||download|
DNA sequences were extracted form the GenBank release 111.0 (April 1999) to date of study (Rogic et al., 2001). Source organisms were--H. sapiens,M. musculus, R. norvegicus. The ratio of human:mouse:rat sequences is 103:82:10. The mean length is 7096 bp. The number of single exon genes is 43 and mulit-exon genes is 152 with average number of 4.86 exons per gene and mean exon length 208 bp, and mean intron length 1015 bp. The porportion of coding sequences is 14% against non coding intron sequence of 46% and intergenic region of 40%.
The DNA sequences were extracted from GenBank release 85.0 (October 15,1994) from the vertebrate divisions. Source organisms were all vertebrate organisms. A total of 570 sequences were obtained after clean up procedure (Burset and Guigo, 1996) totalling 2,892,149 bp. There were 2649 coding exons, correcponding to 444,498 coding bp (~15%). All the sequences are having multi-exon genes.
GENIE gene finding data set, containing a total of 793 unrelated human genes. This data set was used to train the GENIE gene finding system (WWW-access) developed at LBNL and UC Santa Cruz. The last update was done in March 1998 using GenBank v.105. Please see the documentation file for further information, including links to previous versions of this data collection.
The GENIE set of unrelated D. melanogaster genes . This data set was used to train the GENIE gene finding system developed at LBNL and UCSC. The last update was done in October 1998 using GenBank v.109. Please see the documentation file for further information, including links to previous versions of this data collection. This data set is developed by Martin Reese (LBNL) with help from Uwe Ohler (University of Erlangen), David Kulp (UCSC) and Andrew Gentles (Stanford). It has 416 gene sequences including 275 multi-exon and 141 single exon genes.
Two sets of sequences were developed. First, a typical benchmark set made of sequences from the EMBL database release 50 (1997) that included 178 human genomic sequences (h178) coding for single complete genes for which both the mRNA and the coding exons are known. Second, a semi-artficial set of genomic sequences consisting of 42 sequences in which accurate gene-annotation is guaranteed. The h178 set has 50% G+C content, has an average length of 7169 bp with 1 gene each and 5.1 exons per sequence. The semi-artificial sequences have an average length of 177160 bp with 4.1 genes each sequence and 21 exons on average per sequence nd has a G+C content of 40%.
All data were taken from the GenBank collection of human nucleotide sequence data on May 30, 1992. All E. coli sequences were extracted on June 28, 1992. For the primary benchmark, successive non-overlapping windos of length 54 bases were taken from all human genomic sequences. Window length of 108 and 162 were also obtained. Each set of window is split and one set is used for training and and other used for testing the accuracy.
Used for Genome Annotation Assessment Project (GASP) 2000 in Drosophila (Reese et al., 2000). The total size of Adh region is 2.9 Mb. Presently estimated to contain over 200 genes.