Algorithms used in AHTpin
Datasets-The peptides were derived mainly from AHTPDB, BIOPEP and ACEpepDB databases. We have developed models on separate datasets made according to the length of the peptide as in the figure below.
Descriptors Used Amino Acid Composition : It is the percentage content of each amino acid in a peptide/protein and is represented by a vector of 20 values for 20 natural amino acids calculated by the formula
Composition of amino acid (i) = Frequency of amino acid (i) / Length of the protein
where i can be any natural amino acid Atomic Composition : Similar to the Amino Acid Composition, the Atomic Composition of the atoms that constitute the natural amino acids namely Carbon(C), Hydrogen(H), Nitrogen(N), Oxygen(O) and Sulphur(S). For instance the Carbon composition is calculated as ration of number of Carbon atoms in a peptide to the total number of atoms in the peptide.
 Chemical Descriptors : A total of 15,537 types of descriptors, including 1D, 2D, 3D and 10 different types of binary fingerprints were calculated using the PaDEL software and only the minimum number of important descriptors were used for model building.
Machine Learning TechniqueSupport vector machines (SVMs) were trained with the selected sequence and atomic features. SVM allows choosing a number of parameters and kernels. The SVMlight software package (available at http://svmlight.joachims.org/) was used to construct SVM classifiers and regression models. In this study, we used the radial basis function (RBF) kernel:.
k(x ,y)=exp(-γ||x - y||2)
where x and y are two data vectors, and γ is a training parameter.
Evaluation of Regression models
In order to evaluate performance of our models, we used Pearson’s correlation coefficient (R). All models were evaluated using five-fold cross validation technique.
Where n is the size of test set, Eipred and Eiact is the predicted and actual IC50 values respectively.
Evaluation of Classification models
In order to evaluate performance of our models, we used Mathew’s correlation coefficient (MCC). All models were evaluated using five-fold cross validation technique.
where TP, FP, TN and FN stand for True Positives, False Positive, True negative and False Negative respectively.