HOME
HELP
SUBMISSION
ALGORITHM
REFERENCES
DEVELOPERS
CONTACT



FULL DETAILS PAPER

ALGORITHM

Data Sets
We use the data of Hua and Sun (2001), extracted from SWISS_PROT version 33.0. The dataset consisted of complete and non-redundant proteins with less than 90% sequence identity whose function is experimentally determined. This dataset consisted of total 670-gram negative bacterial proteins.
We use the data of Hua and Sun (2001), extracted from SWISS_PROT version 33.0. The dataset consisted of complete and non-redundant proteins with less than 90% sequence identity whose function is experimentally determined. This dataset consisted of total 670-gram negative bacterial proteins (255 cellular process, 60 information molecule, 285-metabolism protein and 70-virulence factors protein.).


SVM
The SVM was implemented using freely downloadable software package SVM_light written by Joachims (Joachims 1999). The software enables the user to define a number of parameters as well as to select from a choice of inbuilt kernal functions, including a radial basis function (RBF) and a polynomial kernal.

Evaluation Modules
The performance modules constructed in this study were evaluated using a 5-fold cross-validation technique. In the 5-fold cross-validation, the relevant dataset was partoned randomly into five equally sized sets. The training and testing was carried out five times, each time using one distinct set for testing and the remaining four sets for training. For evaluating the performance of various modules, accuracy and Matthew’s correlation coefficient (MCC) were calculated using the following equations:


where x can be any functional class (cellular, information, metabolism and virulence protein), exp(x) is the number of sequences observed in function x, p(x) is the number of correctly predicted sequences of function x, n(x) is the number of correctly predicted sequences not of function x, u(x) is the number of under-predicted sequences and o(x) is the number of over-predicted sequences.

Prediction Approaches
Ab-initio Patterns : The peptide of 4 mers were generated from the sequences of training set (excluding the testing sequences) of four functional classes, then unique peptide from these complete 4 mers peptide were sorted out. Following the frequency of each unique 4 mers peptide of four functional was counted. The peptides of 4 mers were categorized based on their frequency and termed patterns if it present in higher number. The peptide of frequency greater than 5 in case cellular protein, greater than 2 in case of information molecule, greater than 5 in case of metabolic protein and greater than 3 in case of virulence protein were termed as patterns and were used in the present study.

Amino acid composition : Amino acid composition is the fraction of each amino acid in a protein.

Dipeptide composition : Dipeptide composition was used to encapsulate the global information about each protein sequence, which gives a fixed pattern length of 400 (20 ´ 20). This representation encompassed the information about amino acid composition along local order of amino acid.



Raghava's Home page