CellPPD: Designing of Cell Penetrating Peptides

About Datasets

We have taken 843 experimentally validated CPPs from the CPPsite databse (Gautam et al, 2012) and generated three different datasets form these peptides. Three datasets are as follows:
CPPsite-1
In this dataset, we have included 708 CPPs having either high or low cell penetrating efficiency taken from CPPsite database. Here, we removed all the peptides containing non-natural and D-amino acids. Since we did not find experimentally proved non-CPPs in the literature, so we have generated equal number of peptides randomly from SwissProt proteins and considered them as negative.
CPPsite-2
In CPPsite database, we found peptides with different penetrating efficiency from low to high. In this dataset, we have taken 187 highly efficient CPPs. Negative peptides were generated randomly from SwissProt proteins.
CPPsite-3
Here, we have included CPPs with high penetrating efficiency same as in CPPsite-2 and CPPs with low efficiency were taken as negative peptides. We developed this dataset because it allows us to discriminate between CPPs with high and low penetration efficiency.


Support Vector Machine

In the present study, SVM classifier was used from freely available SVM_light package. This package is powerful as well as user-friendly where we can adjust the parameters and kernel functions like Linear, Polynomial, RBF and Sigmoid.

Evaluation or Performance

Five-fold cross validation technique has been used. Four sets are used for training and remaining one in used for testing, in this way the process repeats five times. Evaluation of performance of different SVM modules has been done by calculating accuracy and Matthew's correlation coefficient (MCC).

Input features for SVM

In this study we have been used various features as SVM input for the prediction of CPPs.

1. Amino Acid Composition: Amino Acid Composition is the fraction of each amino acid present in a peptide. There are 20 vectors generated in which one corresponds to one amino acid and these vectors used for as SVM input.
2. Dipeptde Composition: Dipeptde Composition is the fraction of each dipeptide like AA, AC, AD and so on. It provides compositional as well as local order each residue present in the peptide. It contains 20x20 (400) vectors.
3. Binary Profile pattern: Binary Profile pattern is represented by 20 vectors for each amino acid. For a peptide of length n, there are nx20 vectors generated in binary form which were used as SVM input (as shown in the figure below).



4. Physico-chemical Properties: Physico-chemical Properties of each amino acid like hydrophobicity, hydrohpilicity, charge, pI etc. has been used as input feature for the prediction of CPPs. We obtained physico-chemical properties values of each amino acid form the webserver AAindex and used them to calculate physico-chemical properties of peptide by Perl programmes.

Hybrid Method

We observed that there are number of motifs present in the CPPs. So, we have used this motif information for the prediction of CPPs. CPP motifs was searched by the MEME software and then query sequences were hit with the CPP motif list by MAST software. If hit was found against a peptide, its SVM score is increased by 5. So, it will be predicted as CPP irrespective of SVM threshold. This approach increases the reliability of our prediction method.