CTLPred: A SVM & ANN Based CTL epitope Prediction method

Introduction:-

The identification of peptides that can stimulate Cytotoxic T Lymphocytes (CTLs) is one of the major challenges in subunit vaccines design. Most of existing epitope prediction methods are based on identification of MHC binding peptides. It is not necessary the all MHC binders can act as T cell epitopes. Thus, there is a need to develop a highly accurate prediction method for CTL epitopes instead of MHC binders. The use of artificial neural network and support vector machine on the recent and high quality CTL epitopes and non-epitopes data is explored as a means to meet these challenges. Here, machine learning techniques SVM and ANN have been used to develop a CTL epitope prediction method. First, an ANN based method has been developed for classify the large dataset of CTL epitopes (1137) and non-epitopes (1134). With the implementation of feed-forward backpropagation type of network 72.2% accuracy is achieved. Further, an SVM model was generated by applying SVM_LIGHT on same dataset. The optimized SVM classified the data with 75.2% accuracy, which is better in comparison to ANN. The accuracy of methods was estimated through leave one out cross-validation (LOOCV) at a cutoff score where sensitivity and specificity are nearly equal. In the last, both prediction methods were combined for utilizing their complete potential in classifying the data. The consensus and combined prediction resulted in improvement of specificity and sensitivity respectively along with accuracy. The best accuracy obtained by consensus and combined prediction approaches are 77.6% and 75.8% respectively, which is significantly higher as compared to the individual methods.The overall architecture of the prediction method is shown in figure below.

The overall architecture of CTLpred.
The method is divided in three parts
1) Data extraction and Preprocessing.
2) Training and Testing of methods
3) Consensus and Combined Prediction Approaches.
Where E means epitopes, NE means Non-epitopes, LOOCV means Leave One Out Cross Validation, SNNS means Stuttgart Neural Networks System, SVM means Support Vector Machine, PE means Predicted Epitopes,PNE means Predicted Non-epitopes,Cnp means Consensus Prediction and Cbp means Combined Prediction.

[Top] [Home]

Source of data-: The accuracy of prediction method depends on the quality and quantity of data. All the data of the CTL epitopes and non-epitopes has been obtained from MHCBN version1.1, a comprehensive database of MHC binding and non-binding peptides. The database is consist of nearly 19500 mhc binders and non-binder and 7000 epitopes.

Artificial Neural Network-: The artificial neural networks are crude electronic model based on the structure of brain.A network can consist of a few to a few billion neurons connected in an array of different methods. ANNs attempt to model these biological structures both in architecture and operation.The basic computational element (model neuron) of neural network is often called a node or unit. It receives input from some other units, or perhaps from an external source. Each input has an associated weight w, which can be modified. The unit computes some function f of the weighted sum of its inputs.

Its output, in turn, can serve as input to other units.The weighted sum is called the net input to unit i, often written neti. Note that wij refers to the weight from unit j to unit i (not the other way around). The function f is the unit's activation function. In the simplest case, f is the identity function, and the unit's output is just its net input. This is called a linear unit..

[Top] [Home]

Basic Architecture-: The basic architecture of artificial neural network is shown through figure below

There are many types of networks ranging from simple networks(Perceptrons) to complex self -orgnising networks(Kohonen networks).Similarly, there are many different kinds of learning rules used by neural networks, the common being the delta rule. The delta rule is often utilized by the most common class of ANNs called 'backpropagational neural networks' (BPNNs). Backpropagation is an abbreviation for the backwards propagation of error. With the delta rule, as with other types of backpropagation, 'learning' is a supervised process that occurs with each cycle or 'epoch' (i.e. each time the network is presented with a new input pattern) through a forward activation flow of outputs, and the backwards error propagation of weight adjustments. More simply, when a neural network is initially presented with a pattern it makes a random 'guess' as to what it might be. It then sees how far its answer was from the actual one and makes an appropriate adjustment to its connection weights.The other learning rule that is mostly used is feed-forward type of neural network.

[Top] [Home]

Support Vector Machine-: The support vector machine has become a increasingly popular tool for classification of linear or non-linear data.The SVM has been recenly used for the classification of microarray data.The approach is systematic,reproducible and motivated by statistical learning theory.The SVM based methods are also known as Kernel based methods as they use the kernel subsituton for the classification of the data.To understand the power and elegance of SVM approach ,one must grasp three key ideas: margins,duality and kernels.The detailed knowledge about the concepts of the SVM can be obtained from books "An introduction to support vector machine and other kernel-based learning methods" and "Statistical Learning Theory".

Support Vector Machine::- http://www.support-vector.net/

SVM-Light Support Vector Machine::-http://svmlight.joachims.org/

Kernel based methods::-http://kernel-machines.org

LIBSVM::-http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Advances in Kernel Methods::-http://kernel-machines.org/nips97/book.html

SVM links::-http://vision.ai.uiuc.edu/mhyang/svm.html

[Top] [Home]

Results of Prediction-: The evalvuation of accuracy of prediction method is neceassary to estimate the performence of a method.The Leave One Out Cross-Validation (LOOCV) procedure was employed to estimate the performance accuracy of the prediction methods. The LOOCV procedure involves removing of one peptide from the training data; training is done on the basis of remaining data and then testing was done on this removed peptide. It is the most accurate way to estimate the performance of method when the training data is small. The performence of the method is evalvualted by using the Senstivity, Specificity, Positive predictive value (PPV), Negative predictive value (NPV) and Accuracy.The results were obtained by averaging the results over the testing subsets. i.e. equal to the number of examples in the training data.
The trained network was able to achieve the 72.2% accuracy at a cut off score where the sensitivity and specificity were nearly equal.This cut off score was considered as default cut off score.Further, SVM is also applied on same training data. The sole purpose for the application of two techniques was to estimate which of the two techniques is more superior in classifying the data. Another aim behind the application of two techniques was to develop a method based on combination of these two machine-learning techniques.The SVM was able to achieve the 75.2% of accuracy at a cutoff score where the sensitivity and specificity were 73.8% and 77% respectively.The SVM model was able to classify the data with ~3% more accuracy as compared to ANN. These results support the fact the SVM is able to classify the data more accurately as compared to ANN.
These SVM and ANN based prediction methods were combined to establish the upper limit of sensitivity, specificity, accuracy.In consensus prediction, the accuracy of the prediction achieved was 77.6% which is nearly 4% more than ANN prediction and ~2% higher than SVM based prediction. The specificity of consensus prediction approach is nearly 7% higher as compared to the specificity of ANN and SVM at a cutoff score where sensitivity and specificity are nearly equal. In case of the combined method, the accuracy of prediction was nearly ~3% higher than ANN prediction and marginally greater than SVM based prediction.

[Top] [Home]

Comparsion with other Prediction methods-: For the sake of comparison we had implemented two methods of T cell epitope prediction AMPHI and EpiMer on our dataset. Unfortunately prediction accuracy of AMPHI method is 53% at a cutoff score where sensitivity and specificity are nearly equal, which is not better than random prediction. The EpiMer was able to classify the data with 62% of accuracy. It clearly demonstrated that our prediction methods based on the SVM and ANN is better in CTL epitope prediction as compared previously developed method.

References-:
Adams,H.P. and Koziol,J.A.(1995) Prediction of binding to MHC class I molecules. J. Immunol. Methods, 185,181-90.

Bhasin,M., and Raghava,G.P.S. (2003) A Hybrid Approach for the Prediction of MHC Class I Restricted T cell epitopes (submitted).

Bhasin,M., Singh,H. and Raghava,G.P.S. (2003) MHCBN: A comprehensive database of MHC binding and non-binding peptides. Bioinformatics. 19, 666-667

Brown, M., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C.W., S,T., AresJr, M., and haussler, D. (2000) Genetics knowledge-based analysis of microarray geneexpression data by using support vector machines. Proc.Natl.Acad.Sci, USA ,97, 262-267.

Brunak,S. and Buus, S. (2000) Identifying cytotoxic T cell epitopes from genomic and proteomic information:"The human MHC project.". Rev Immunogenet. 2, 477-91.

Brusic,V., Rudy,G. and Harrison,L.C. (1994) Prediction of MHC binding peptides by using artificial neural networks. In complex mechanism of adaptation, 253-260, IOS Press, Amsterdam.

[Top] [Home]

Buus, S. (1999) Description and prediction of peptide-MHC binding: the 'human MHC project'. Curr. Opin. Immunol., 11, 209-13. Review.

Cornette,J.L., Margalit,H., Delisi,C. and Berzofsky, J.A. The amphipathic helix as a structural feature involved in T cell recognistion. ln: The Amphipathic Helix (Ed. Epand,R.M.). CRC Press, Boca Raton, 1993

Cristianini, N. and Shawe-Taylor,J. (2000) Support Vector machines and other kernel –based learning methods. Cambridge University Press,Cambriddge England The Edinburg Building, Cambridge, CB2 2RU, UK.

De Groot,A.S, Sbai,H., Aubin,C.S., McMurry, J. and Martin,W. (2002) Immuno-informatics: Mining genomes for vaccine components. Immunol Cell Biol., 80, 255-69.

DeLisi,C. and Berzofsky,J.A. (1985) T-cell antigenic sites tend to be amphipathic structures. Proc Natl Acad Sci U S A, 82, 7048-52.

Ding, C. and Dubchak, I. (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17,349-358.

Donnes,P. and Elofsson,A. (2002) Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics,3,25.

Doytchinova,I.A. and Flower,D.R. (2001) Toward the quantitative prediction of T-cell epitopes: coMFA and coMSIA studies of peptides with affinity for the class I MHC molecule HLA-A*0201. J Med Chem., 44, 3572-81.

[Top] [Home]

Gulukota,K., Sidney,J., Sette,A. and DeLisi,C. (1997) Two complementary methods for predicting peptides binding major histocompatibility complex molecules. J Mol Biol., 267, 1258-67.

Hagmann,M. (2000) Computer auded vaccine design. Science, 290, 80-82.

Hammerling,G.J., Vogt,A.B. and Kropshofer,H. (1999) Antigen processing and presentation- towards the millennium. Immunol Rev., 172 ,5-11.

Hertz,J.A., Palmer,R.G. and Krogh,A.S. (1991) Introduction to theory of neural computation. Addison-wesley, Redwood city.

Joachims, T. (1999)Making large-Scale SVM Learning Practical. In: B Scholkopf and C Burges and A Smola, (eds) Advances in Kernel methods –support vector learning. MIIT Press, Cambridge massachusetts,London England .

Long,E.O. and Jacobson S. (1989) Pathways of viral antigen processing and presentation to CTL: defined by the mode of virus entry?. Immunol Today,10, 45-8.

Meister, G.E., Roberts, C.G., Berzofsky, J.A. and De Groot, A.S. (1995) Two novel T cell epitope prediction algorithms based on MHC-binding motifs; comparison of predicted and published epitopes from Mycobacterium tuberculosis and HIV protein sequences. Vaccine 13, 581-91.

[Top] [Home]

Mouritsen, S., Meldal, M., Ruud-Hansen, J. and Werdelin, O.(1991) T-helper-cell determinants in protein antigens are preferentially located in cysteine-rich antigen segments resistant to proteolytic cleavage by cathepsin B, L, and D. Scand J Immunol. 34, 421-31.

Parker,K.C., Bednarek,M.A. and Coligan,J.E. (1994) Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual peptide side-chains. J Immunol., 152,163-75.

Rammensee,H.G., Friede,T. and Stevanoviic,S. MHC ligands and peptide motifs: first listing. Immunogenetics, 41, 178-228. Review.

Rothbard, J.B. and taylor,W.R.(1988) A sequence pattern common to T cell epitopes. EMBO J. 7, 93-100.

Ruppert,J., Sidney,J., Celis,E., Kubo,R.T., Grey,H.M. and Sette,A. (1993) Prominent role of secondary anchor residues in peptide binding to HLA-A2.1 molecules. Cell, 74, 929-37.

Schueler-Furman,O., Altuvia,Y., Sette,A. and Margalit,H.(2000) Structure-based prediction of binding peptides to MHC class I molecules:application to a broad range of MHC alleles. Protein Sci,. 9,1838-46.

Spouge,J.L., Guy,H.R., Cornette,J.L., Margalit,H., Cease, K., Berzofsky, J.A and Delisi,C. (1987) Strong conformational propensities enhance T cell antigenicity. J.Immunol.138, 204-212.

[Top] [Home]

Stern, L.J., Brown, J.H., jardefzky, T.S., Gogra,J.c., Urban,R.G., Strominger,J.L. and Wiley, D.C. (1994) Crystal structure of the human class II MHC protein HLA-Dr1 complexed with an influenza virus peptide. Nature, 368, 215-221.

Stille, C.J., Thomas,L.J., Reyes,V.E. and Humphreys, R.E. (1987) Hydrophobic strip of helix algorithm for selection of T cell-presented peptides. Molec. Immunol. 24, 1021-1027.

Vapnik, V. N. (1998) The nature of Statistical Learning Theory. Wiley New York.

Watts,C. and Powis,S. (1999) Pathways of antigen processing and presentation. Rev Immunogenet.,1,60-74. Review.

Zavalijevski, N., Stevens, F.J., and Refiman, J. (2002) Support vector machinesw with selective kernel scaling for protein classification and identification of key amino acids postions.

Zell,A. and Mamier,G. (1997) Stuttgart Neural Network Simulator version 4.2 university of Stuttgart.

Zhao,Y., Gran,B., Pinilla,C., Markovic-Plese,S., Hemmer,B., Tzou,A., Whitney,L.W., Biddison,W.E., Martin,R. and Simon, R. (2001)Combinatorial peptide libraries and biometric score matrices permit the quantitative analysis of specific and degenerate interactions between clonotypic TCR and MHC peptide ligands. J Immunol. 167, 2130-41.

[Top] [Home]