AlgPred: PREDICTION OF ALLERGENIC PROTEINS AND MAPPING OF IgE EPITOPES

Sudipto Saha and G. P. S. Raghava*

Institute of Microbial Technology, Sector-39A, Chandigarh, India

Running Tite: Allergens prediction

Address correspondence to: Dr. G. P. S. Raghava, Scientist, Bioinformatics Centre Institute of Microbial Technology Sector 39A, Chandigarh, INDIA, Phone: +91-11-26907444; Fax: +91-172-2690632 E-mail: raghava@iiitd.ac.in

Available at : http://webs.iiitd.edu.in/raghava/algpred

Supplemental data

Dataset
The protein sequence sets used in this study are publicly available on http://www.slv.se/templatesSLV/SLV_Page____9343.asp (Bjorklund et al., 2005). The dataset contains 578 experimental allergens and 700 non allergens protein sequences derived from food. We obtained IgE epitope from SDAP and Bcipep database.

The performance of SVM module based on dipeptide composition

The following table demonstrates the performance of SVM based method using dipeptide composition. The RBF kernel was used and the values are g=100; c=1; and j=1.

Table S1: Performance of SVM based method using dipeptide composition

Threshold	Sensitivity	Specificity	Accuracy	PPV	NPV	MCC
1.0	0.0957	1.0000	0.5922	1.0000	0.5742	0.2346
0.8	0.1791	1.0000	0.6298	1.0000	0.5978	0.3275
0.6	0.2696	0.9886	0.6643	0.9509	0.6229	0.3852
0.4	0.3913	0.9686	0.7082	0.9109	0.6602	0.4538
0.2	0.5461	0.9429	0.7639	0.8870	0.7174	0.5441
0.0	0.7043	0.9100	0.8173	0.8654	0.7903	0.6353
-0.2	0.8278	0.8500	0.8400	0.8193	0.8586	0.6786
-0.4	0.8922	0.7743	0.8275	0.7645	0.8988	0.6657
-0.6	0.9374	0.6771	0.7945	0.7046	0.9312	0.6259
-0.8	0.9600	0.5657	0.7435	0.6449	0.9474	0.5589
-1.0	0.9757	0.4471	0.6855	0.5918	0.9601	0.4840

The performance of Hybrid approach using SVM based on dipeptide composition and IgE based approach

The following table shows the performance of hybrid approach, which combines SVM module based on dipeptide composition and IgE epitope based approach(PID865). A protein is assigned as allergen if predicted allergen by IgE method (PID865) and also have SVM score ³ -0.5 or protein is assigned allegen or non allergen using SVM approach, when protein have no similarity with known IgE epitopes. The table shows that one might achieve better sensitivity without loosing much specificity at higher threshold.

TableS2: The performance of hybrid approach, which combines SVM based approach using dipeptide composition and IgE epitope based approach (PID865).

Threshold	Sensitivity	Specificity	Accuracy	PPV	NPV	MCC
1.0	0.2296	0.9943	0.6494	0.9706	0.6116	0.3613
0.8	0.3043	0.9943	0.6831	0.9777	0.6356	0.4283
0.6	0.3878	0.9829	0.7145	0.9489	0.6622	0.4764
0.4	0.4922	0.9643	0.7514	0.9188	0.6988	0.5314
0.2	0.6104	0.9386	0.7906	0.8909	0.7466	0.5922
0.0	0.7322	0.9057	0.8275	0.8645	0.8056	0.6544
-0.2	0.8365	0.8457	0.8416	0.8166	0.8642	0.6823
-0.4	0.8957	0.7743	0.8290	0.7652	0.9018	0.6693
-0.6	0.9374	0.6771	0.7945	0.7046	0.9312	0.6259
-0.8	0.9600	0.5657	0.7435	0.6449	0.9474	0.5589
-1.0	0.9757	0.4471	0.6855	0.5918	0.9601	0.4840

The performance of hybrid approach, which combines SVM based approach using amino acid composition and motif based approach (MEME ev 0.1).

The following table shows the performance of hybrid approach using SVM based approach using amino acid composition and motif based approach (MEME ev 0.1). No improvement was observed.

Table S3: The performance of hybrid approach, which combines SVM based approach using amino acid composition and motif based approach (MEME ev 0.1).

Threshold	Sensitivity	Specificity	Accuracy	PPV	NPV	MCC
1.0	0.4504	0.8986	0.6965	0.7848	0.6663	0.3973
0.8	0.5270	0.8871	0.7247	0.7932	0.6962	0.4507
0.6	0.6087	0.8757	0.7553	0.8009	0.7324	0.5088
0.4	0.6661	0.8571	0.7710	0.7930	0.7585	0.5378
0.2	0.7235	0.8443	0.7898	0.7924	0.7891	0.5752
0.0	0.7948	0.8186	0.8078	0.7825	0.8304	0.6139
-0.2	0.8748	0.7900	0.8282	0.7738	0.8862	0.6632
-0.4	0.9130	0.7557	0.8267	0.7543	0.9152	0.6699
-0.6	0.9252	0.7000	0.8016	0.7170	0.9211	0.6324
-0.8	0.9565	0.6414	0.7835	0.6866	0.9493	0.6175
-1.0	0.9670	0.5657	0.7467	0.6465	0.9565	0.5677

The performance of various methods on Independent dataset

The following table shows the performance of various approaches on independent dataset of 664 allergens obtained from Li et al., 2004 and dataset of 323 proteins, which excludes all those proteins found in Bjorklund et al., 2005. Here, SVMc =SVM based on amino acid compsition; SVMd =SVM based on dipeptide composition; * PID865

Table S4: The performance of various approaches on independent dataset

Methods	Sensitivity or percent coverage
Methods	664 allergens	323 allergens
SVMc	88.25	84.21
SVMd	89.46	84.82
IgE epitope*	14.78	10.83
Mast	12.65	12.38
ARPs Blast (e-value .001)	83.58	66.56
SVMc + IgE epitope (PID865)	88.86	84.83
SVMd + IgE epitope	90.06	85.14
SVMc+IgE epitope+Mast	89.61	85.76
SVMd+IgE epitope+Mast	90.66	85.76
SVMc+IgE epitope +Mast +BLAST	96.84	93.5
SVMd+IgE epitope+Mast +BLAST	96.08	92.26

Fig. S1 ROC plot of SVM amino acid and combined approach

Fig. S2 ROC plot of SVM dipeptide and combined approach

Development of IgE epitope prediction method

There were 183 epitopes collected from SDAP and Bcipep. We examined all the epitope length and their frequency and found epitope length varied from 4 to 22 mers. There were 67 epitopes with 10 mers length, which was the highest frequency. So, we tried to develop a method based on this 67 epitopes and obtained 67 random peptide of 10 mers from nonallergen database (obtained from consumed commodities, such as rice, apple, milk,etc.).

We tried SVM method based on amino acid composition and sequence information and used five fold cross validation technique (four training set and one testing set).