HLP: A webserver for predicting half-life of peptides in intestine like environment
     Home | Submit: Peptide, Protein,  Batch | Data sets | Algorithm | Help | Links | Team | Contact us



Predict peptide half-life and antibacterial activity in batch mode.
Simultaneously, a number of peptides can be submitted to predict their half-life and antibacterial activity.

Algorithm Used

In this study we have used Gorris et. al.(2009) data (containing peptides whose half-life is experimentally determined in crude intestinal proteases preparation) for training & testing our models for predicting half-life of peptides. This study comprises two types of data: a) 189 peptide sequences [10mer] b) 186 peptide sequences [16mer]. The following table provides link to details about dataset, methods, algorithms and formulae used during SVM and WEKA based model development:

Statistics of Dataset Used in Model DevelopmentPerformance of Models Used on HLP webserver
Procedure for Model DevelopmentCalculation of Physicochemical Properties for Peptides
Scanning of Protein to Generate PeptidesInput Features for SVMlight and WEKA
Formulae Used to Evaluate Performance of ModelsResults Summery




Statistics of Dataset Used in Model Development

The following table show statistics of the peptide sequences used during the development of half-life prediction models. A total of 189 peptides were used for the development of 10mer and 186 peptides were used for the development of 16mer half-life prediction model.

Half-life Prediction
ModelUnique Sequences
10mer189
16mer186




Performance of Models Used on HLP webserver

(a) Performance of SVM based Models Used on HLP webserver

Model TypeModelInput FeatureResidues UsedTotal AttributesRR2MAE
SVM based10merDipeptide CompositionAll residues4000.680.461.44
SVM based16merAmino Acid CompositionAll residues200.910.820.18

(b) Performance of WEKA based Models Used on HLP webserver

Model TypeModelInput FeatureResidues UsedTotal AttributesRR2MAE
WEKA based10merDipeptide CompositionEK, EL, GD, GF, IE, KP, PG, YL80.700.461.22
WEKA based16merDipeptide CompositionCG, GD, GF30.980.960.06

For more details on results, please go to manuscript.






[Go to Top]

Overview of Procedure for SVM and WEKA based Model Development

First of all, 10mer and 16mer peptide sequences accompanied with their half-life values (in seconds) were extracted from Table S2 (Click to see) of Supporting Information provided by Gorris et al (2009). Peptide sequences (10mer / 16mer) were used to calculate amino acid, dipeptide, tripeptide percentage compositions and binary patterns to be used as input vectors for SVMlight and WEKA software. In order to avoid any mismatch between peptide sequence and half-life, their Half-life values were kept linked with them during the whole process of calculation. These input vectors were used one by one to develop various SVM and WEKA based models. The five fold cross-validation technique was implemented to evaluate the performance of developed models in terms of Pearson's correlation coefficient (R), Coefficient of determination (R2) and Mean Absolute Error (MAE). Best performing 10mer and 16mer half-life prediction models have been used on HLP webserver. The pictorial representation of the whole process is as given below:

Figure 1: Flow chart showing methods used for developing peptide half-life prediction models.

[Go to Top]

Detailed Description of Procedure for SVM based Model Development

Figure 2 represents the detailed description of steps followed during the development of best performing SVM based half-life prediction models 10mer and 16mer.
(A) Preparation of Input files for svm_learn:
To develop these SVM based models, peptide sequences (10mer / 16mer) were used as input to our in-house Perl scripts to calculate input features (such as amino acid, dipeptide, tripeptide % composition and binary pattern) for SVMlight's svm_learn binary (a software used to develop SVM based models). We are taking amino acid composition and half-ife containing file as example file for developing SVM model. This file is randomly divided into five equal parts known as Set 1 to Set 5.

Figure 2: Flow chart showing methods used for developing SVM based peptide half-life prediction models.



(B) (I) Using "svm_learn" and "svm_classify" to develop SVM based models:
As per documentation of SVMlight, svm_learn can be used in four modes mainly i.e., classification (-z c) / regression (-z r) / preference ranking (-z p) mode and can make use of linear (-t 0) / polynomial (-t 1) / radial basis function (-t 2) / sigmoid (-t 3) / user defined kernel from kernel.h (-t 4) as kernels. User may run "svm_learn -" on their Linux terminal to check list of available options. In present study, svm_learn was run in regression mode (-z r). Using RBF (-t 2) kernel and various available parameters for this kernel (such as g = 1, 0.1, 0.01, 0.001, 0.0001; c = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10; j = 1, 2, 3, 4, 5), svm_learn develops Model 1 to 5 making the use of training sets (green colored sets). Example command to run svm_learn to generate Model 1 (using parameters g=1,c=1,j=1) is: svm_learn -z r -t 2 -g 1 -c 1 -j 1 Training_set_1 Model1
(II) Using "svm_classify" to predict half-life of peptides:
As depicted in figure, svm_classify uses Model 1 to 5 for different Test sets (red colored sets) to provide Predicted 1 to 5 files (contains predicted half-life for respective Test set peptides). Extracting half-lives from Test sets (red colored sets) makes Actual 1 to 5 files (contains actual half-life for peptides present in test sets). The actual and predicted half-life containg files are pasted in single file called RESULTS. Thus, RESULTS contains tab-separated, adjacently placed, actual and predicted half-life values for Test sets 1 to 5 (red colored). The RESULTS file is used by our in-house Perl script to calculate Pearson's correlation coefficient (R), coefficient of determination (R2) and mean absolute error (MAE) between actual and predicted half-life values. Example command to run svm_classify on Model1 is: svm_classify Test_set_1 Model1 Result_1"
(C) The entire process mentioned in step (B) is repeated for all possible combinations of parameters to retrieve several R, R2 and MAE values. The highest R, R2 and least MAE containing models have been used on HLP webserver.

[Go to Top]




Calculation of Physicochemical Properties for Peptides

Any physicochemical property value for user submitted / protein dervied / mutant peptides is being calculated as the sum of value of that property for an individual amino acids (from starting residue to full length of a particular peptide) except for isoelectric point (pI) where average of isoelectric point values for individual amino acids is taken. The sources of physicochemical property values for individual amino acids (used for the calculation of physicochemical properties of peptides) are given below:

Sr.No.Physicochemical PropertyReference
1HPLC parameterHPLC parameter (Parker et al., 1986)
2Hydrophobicity (KJ/mol)Prabhakaran, M. 1990 Biochem. J. 269, 691-696
3pKaD R Lide Hand Book of Chemistry and Physics, 72nd Edition, CRC Press, Boca Raton, FL, 1991
4pKbD R Lide Hand Book of Chemistry and Physics, 72nd Edition, CRC Press, Boca Raton, FL, 1991
5Residue volumeGoldsack-Chalifoux, 1973
6Molecular weightFasman, 1976
7Isoelectric point (pI)Zimmerman et al., 1968
8Surface AccessiblityJanin et al., J. Mol. Biol. 125, 357 (1978)
9FlexiblityBhaskaran and Ponnuswamy 1988
10Chargeexample scale
11PolarityPonnuswamy et al., Biochim. Biophys. Acta 623, 301 (1980)
12Relative mutabilityIn "Atlas of Protein Sequence and Structure", Vol.5, Suppl.3 (Dayhoff, M.O., ed.), National Biomedical Research Foundation, Washington, D.C. pp. 345-352.
13Free energy of solution in water, kcal/moleCharton, M. and Charton, B.I., (1982): The structural dependence of amino acid hydrophobicity parameters, J. Theor. Biol. 99, 629-644.
14Optical rotation"Handbook of Biochemistry and Molecular Biology", 3rd ed., Proteins - Volume 1, CRC Press, Cleveland (1976)
15Entropy of formationIn "Handbook of Biochemistry", 2nd ed. (Sober, H.A., ed.), Chemical Rubber Co., Cleveland, Ohio, pp. B60-B61 (1970)
16Heat capacityIn "Handbook of Biochemistry", 2nd ed. (Sober, H.A., ed.), Chemical Rubber Co., Cleveland, Ohio, pp. B60-B61 (1970)
17Relative stabilityZhou, H. and Zhou, Y (2004): Quantifying the effect of burial of amino acid residues on protein stability, Proteins 54, 315-322.































[Go to Top]




Scanning of Protein to Generate Peptides

Figure 3: Diagram showing generation of overlapping peptides (10mer) from a protein sequence





[Go to Top]






Input Features for SVMlight and WEKA

Amino acid composition and dipeptide composition were used as input features for SVM and WEKA based model development. The amino acid composition is fraction of each amino acid in a peptide and converts a peptide sequence to a vector of 20 dimensions. The dipeptide composition in a peptide is the percentage of the different adjacent pairs of amino acids represented in a particular peptide. Dipeptide composition converts a peptide sequence to a vector of 400 dimensions and helps encapsulating the properties of the neighboring amino acids. The tripeptide composition in a peptide is the percentage of the three adjacent amino acids represented in a particular peptide. Tripeptide composition converts a peptide sequence to a vector of 8000 dimensions and helps encapsulating the properties of the neighboring amino acids.

Amino acid (i) composition (%)    =    Total number of amino acid (i) in peptide sequence × 100
Total number of amino acids in peptide sequence


Dipeptide (i+1) composition (%)    =    Total number of amino acid (i+1) in peptide sequence × 100
Total number of all possible dipeptides in peptide sequence


Tripeptide (i+2) composition (%)    =    Total number of amino acid (i+2) in peptide sequence × 100
Total number of all possible tripeptides in peptide sequence









Where (i) is any amino acid.

Binary Pattern: Each amino acid of a peptide is represented by binary pattern of 20, where 1's represents the presence of concerned amino acid at that position and 0's for the absence of other 19 amino acids (e.g. Ala is represented by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0). Thus, a vector of dimension N X 20 is used to represent the peptide of length N (size of peptide).

SVM

The SVM was implemented using freely downloadable software package SVM_light written by Joachims (Joachims 1999). The software enables the user to define a number of parameters as well as to select from a choice of inbuilt kernel functions, including a radial basis function (RBF) and a polynomial kernel.

WEKA

In this study IBk, DecisionTable based algorithm implemented in Weka package has also been used to predict half life of 10mer, 16mer peptides respectively.


[Go to Top]

Evaluation of SVM and WEKA based Models Performance

The performance of models constructed in this study was evaluated using a 5-fold cross-validation technique. In the 5-fold cross-validation, the relevant dataset was partitioned randomly into five equally sized sets. The training and testing was carried out five times, each time using one distinct set for testing and the remaining four sets for training. The performance of the models was computed using Pearson's Correlation Coefficient (R), Coefficient of determination (R2) and mead absolute error (MAE) formulae as given below:











Where yi and xi represent predicted and actual half-life values for ith peptide. N is total number of peptides. SD is the sum of the squared deviations between the activities of the test set and mean activities of the training peptides.

[Go to Top]




Results Summery

In conclusion, SVM based models developed on 10mer dataset have achieved maximum correlation R/R2 0.57/0.32, 0.68/0.46, and 0.69/0.47 using amino acid, dipeptide and tripeptide composition respectively. The models developed on 16mer dataset have showed maximum R/R2 0.91/0.82, 0.90/0.39, and 0.90/0.31 using amino acid, dipeptide and tripeptide composition respectively. Furthermore, models developed (using WEKA) on selected dipeptides have achieved correlation (R) 0.70 and 0.98 on 10mer and 16mer dataset respectively.







Copyright © 2014 Institute of Microbial Technology, Chandigarh, India.    All rights reserved, designed by Bioinformatics Centre