MHC2Pred:SVM based method for MHC class II binders prediction

MHC2Pred Information

General Information-: The helper T cell epitopes are subset of MHC class II ligands and play a decisive role in initiation and maintenance of immune response. Experimental identification of such ligands is arduous, time-consuming, and economically not feasible. Therefore, development of reliable computational methods for their prediction may reduce cost and number of wet lab experiments to identify these peptides. The prediction of MHC class II binding peptides is difficult as compared to MHC class I binding peptides due to their variable size. MHC class II binding peptides are 10-40 amino acids long with a binding core of 9 amino acids containing primary anchor residues. Therefore, in case of MHC class II binders prediction, an additional method for finding binding core of 9 amino acids from ligands of variable length is required. In the literature, numerous reasonably validated mathematical models for the prediction of core and binding properties of these binding peptides are available. The methods for core prediction are based on genetic programming, discriminant analysis and matrix optimization techniques. The methods for the prediction of binders are based on motif, quantitative matrices and artificial neural networks. Most of these methods are available for HLA-DRB1*0401 allele, associated with autoimmune disease rheumatoid arthritis. These methods are not able to predict the peptides binding to many MHC alleles or promiscuous MHC binders. In this study, an attempt has been made to develop a highly accurate prediction method for large number of MHC class II alleles. The matrix optimization technique has been used to detect the binding core of peptides. Subsequently support vector machine has been used for discrimination between binding and non-binding peptides. The overall accuracy of method is >78%, which is better than all the already existing methods in literature.

Algorithm for development of prediction method-: The binders and non-binders for all alleles have been obtained from MHCBN and JenPep database (Bhasin et al., 2003; Blythe et al., 2002). All the peptides having IC50 value less than 500nm has been considered as binders and peptides with IC50 value greater than 500nm are considered as non-binders. Peptides containing less than 9 amino acids have been deleted from the dataset. The binding core of 9 amino acids has been obtained from the binders of variable length without considering MHC binding motifs using Matrix Optimization Techniques (MOT) package (Singh and Raghava, unpublished).

For development of MHC binder prediction method, an elegant machine learning technique SVM (Joachims, 1999; Cristianini and Shawe-Taylor, 2000) has been used.SVM has been trained on the binary input of single amino acid sequence. Each amino acid of 9mer peptide was represented by a 20-dimensional vector. Each peptide of nine amino acids has been represented through a vector of 180 dimensions. The binders have been represented by the +1 and non-binders by -1. A suitable type of kernel for classifying the data has been chosen by conducting experiments with every kernel type i.e. RBF, Polynomial, linear and Sigmoid. The kernel features and regulatory parameter C were optimized by systematic variation in the parameters and evaluations of prediction performance. The overall architecture of SVM based methods is shown in below.

Cross-validation and Performance measures:- The main goal of machine learning approach is to obtain good classification performance on unseen data. Therefore, performance of methods for all alleles has been evaluated using 5-fold cross validation (Kaur and Raghava, 2003). In 5-fold cross-validation the dataset is randomly divided into five equal sized subsets. The method is trained 5 times using 1 distinct set for testing and remaining 4 sets for training. The final performence of the method is obtained by averging. The performance of method has been measured through threshold dependent parameters such as sensitivity, specificity, NPV, PPV and accuracy.