GeneBench: EVALUATION OF GENE FINDERS AND DATASET CREATION SERVER

INSTITUTE OF MICROBIAL TECHNOLOGY
BIOINFORMATICS CENTER

GENEBENCH: EVALUATION OF GENE FINDERS AND DATASET CREATION SERVER

Format of Sequences:

The nucleotide sequence file should be a multi-FASTA file with the header line ">" containing the name of sequence as described in annotation file.
For e.g.

>MUMSEQ
ACGTCGGAGCGAGCACTTTGCAGGCACTAGCATCAGCTACGACTACGA
GGCTAGCATCGAGCGAGGCGACGAGCGACATCAGCTACGACTACGACA
AGCTAGCAGCACACTAGCATCCG
>RATSEQ
GTCGACAGCGAGCGAGCGAGCGAGCATCGACTAGCAGCGACGAGCGAC
GCTAGCGAGCGACTACGATCGACACGATCGACTACGATCGATCAGCAT
GCTAGCATCGATCAGCTACGACTAGCTACGACTAGCATCGACTAGAAA
GGGGCATAGGGGGAAAAAAAAAATCGC
>HUMSEQ
GCGCGCGCATGACAGCGAGGCCCCCCCCCATGACTAGCACGACTAGCA
GGGGGATCAGCTAGCAAAAAAAATAGATCGAGCGGCGACGCGTTAGAA
GCTAGCGACGAGCAGCGCCGGCGCGCGCGCGCGCGCGAAGATA

Format of annotation file:
The annotation file should have all the exons from a sequence in a single line starting with name of sequence, total length of sequence, start of first exon, end of first exon, start of next exon, end of next exon and so on till last exon. Each field is separated by a space.
For e.g.
```
MUMSEQ 5241  10 211 2110 3456 
RATSEQ 11200 2100 3987 4532 5909 6323 9012
HUMSEQ 54125 509 2100 7654 11098 25098 26786 31087 32943
```
In case for predictions, if there is no predicted exon leave empty space after sequence name, otherwise the format is exactly the same as annotation file. The length of sequence is not given in prediction file..
For e.g.
```
MUMSEQ 10 211 2110 3456 
RATSEQ 
HUMSEQ 509 2100 7654 11098 25098 26786 31087 32943
```

Format of ROC file:

The ROC file should have two columns for X and Y axis representing the specificity and sensitivity respectively. The values can be should be in the range of 0 to 1, therefore if you are using a percent range of 0 to 100% divide your values by 100 before analysis.
For e.g.

  Specificity           Sensitivity (row not required)
------------------------------------(row not required)
  1.00                      0.00
  0.99                      0.01
  0.95                      0.45
  0.90                      0.80
  0.88                      0.87
  0.70                      0.90
  0.50                      0.95
  0.15                      0.98
  0.05                      0.99
  0.00                      1.00
------------------------------------(row not required)

Proset Parameters:
PROSET program requires a list of protein sequences which are then aligned using a pairwise alignment algorithm on all pairs of proteins.
R Min: is the minimum length of mer to be searched among the pair of sequences being aligned.
R Max: is the maximum length of mer to be searched among the pair of sequences being aligned.
BD Min: is the minimum block sequence identity between a pair of sequences being aligned above which only the longest sequence is retained.
Number of sequences:
This is the total number of sequences being uploaded for non-homologous dataset creation. The number is required for cross-checking purposes during the pre-alignment processing before dataset creation by GeneBench server.
Reciever Operating Characteristic (ROC):
The Receiver Operating Characteristic (ROC) is used as threshold independent measure of performance. It is obtained by plotting the sensitivity values (true positive fraction) against the "1 - specificity" values (false positive fraction) for various thresholds. The area under the curve i.e. the ROC function is usually taken to be an important index because it provides a single measure of predcitive performance that is independent of threshold. The area under the ROC function (AUC) is usually taken to be an important index because it provides a single measure of overall accuracy that is not dependent upon a particular threshold. The value of the AUC is between 0.5 and 1.0. If the value is 0.5 the scores for two groups do not differ, while a score of 1.0 indicates no overlap in the distributions of the group scores. Typically, values of the AUC will not achieve these limits. A value of 0.8 for the AUC means that for 80% of the time a random selection from the positive group will have a score greater than a random selection from the negative class.