Description of Algorithms
- Generate Hypothetical or Random Genome
- Generate Mutated Genome
- Create Simulated Short Reads for a genome
- Evaluate an Assembler
Create Hypothetical or Random Genome
- We have created an array of nucleotides of length equal to the length of genome provided by the user.
- Then a random number have been generated in the 'for loop(limit equals to the length of given genome)' and nucleotide corresponding to that random number was picked.
- In the last step we have prepared a string of nucleotides by adding randomly picked nucleotides again and again.
Creation of mutated genome
- We have created two arrays, one of four bases i.e. A,T,G and C and other is of the genome provided by the user.
- A random number 'mutvalue' have been generated from the genome provided, which is equal to (Genome length*Percentage mutation value(%))/100.
- Another random number have been generated from the array of four nucleotides.
- Then the nucleotides corresponds to the both random numbers were picked.
- In a 'for loop(limit<=mutvalue)', two bases from these two arrays the replaced by each other.
- In this way, a genome can be mutated.
- > Limitation:- Same random number can be generated in many steps. This might change a base again and again at a particular position. So, a genome might not be 100% mutated.
Creation of simulated reads from random genome
- In the 'while loop' the genome file opened and whole genome is treated as a string.
- A variable 'numread' is generating where 'numread' = (Genome length* Coverage)/Read size.
- In a 'for loop(limit<=numread)' we generated a random number and cut a substring from the position equal to that random number. This is the strategy to make a fragment, like Solexa technology.
- Now, for single end reads, we cut a substring of length equal to read length provided by the user, from that fragment in this for loop.
- For paired end reads a substring of length equal to read length cutted from the opposite end of that fragment.Then we have changed the nucleotides by complementary nucleotides. Then we have reversed the read.
- In this way, we have generated solexa single end and paired end reads as well.
Algorithm for evaluation of genome assembly
- N 50 Contig length = The contig length such that 50% of the the denovo assembled genome lies in blocks of this size or larger.
- Genome covered (%) = Total genome covered(Nucleotides) * 100 / Total reference genome size
- Contig matches (%) = Total nucleotides of contigs matches to reference genome * 100 / Total contig size sum
- Error rate contig(%) = (Total mismatches + Total query gaps + Total hit gaps + Total N's) * 100 / Total contig size sum
- Error rate of total assembly(%) = (Total unalienable base count + Total unalienable contig base count + Total mismatches) * 100 / Total contig size sum