Algorithm of CancerDP

Datasets-

The data in our study includes genomic data and drug sensitivity data and both of these data has been acquired from CCLE database. The genomic data includes sequencing of 1667 important genes, expression of 17627 genes and copy number variation of 21217 genes. The drug sensitivity information for 504 cell lines was available at CCLE browse data section. The data includes IC50 (μM) for 24 established anticancerous drugs on 504 cell lines.

Mutation dataset-

The mutations in a gene were given in IUPAC nomenclature standard, as a mutation annotation file (MAF). Since the IUPAC mutation annotations, cannot be used in machine learning as input feature per se, we have to convert it to in to binary format. The mutated gene and normal gene (not mutated) were represented as ‘1‘ and ‘0’ respectively. The binary (1 or 0) status for every gene, was used as machine learning input. For example mutated gene G1 and normal gene G2 were presented as ‘1’ and ‘0’ respectively.

Variation dataset-

Similar to the mutation data, we explored the variation data which includes general variations found in human genome.

Expression dataset-

The expression data was taken as RMA-normalized expression data of 17627 genes as available in CCLE database for 488 cell lines.

CNV dataset-

The CNV data set includes the normalized data where CNV values are given in log2 of ratio of copy numbers of cancer vs. normal gene.

Machine Learning Methods-

We have incarporated different machine learning algorithms like SVMlight, SMO, KNN, ANN, ISO-Reg, Linear-reg and Grid-search. Our webserver models are based on SVMlight, which performed better than other algorithms. THese models require fixed vector length of input i.e. for prediction, all the input features should be present.

Probabilistic Methods-

In contrast to the machine learning based models, probability based models does not requir any fixed number of input features. The alteration of any genetic feature like mutation,expression or CNV of any gene may contribute to drug rsistance or sensitivity. The probability of such alterations can be seen by this method.