OSDD-LINUX

Next Generation Sequencing (NGS) data analysis

In the era of Next Generation Sequencing (NGS) technology, it is easy to sequence whole genome, exome and transcriptome of an organism. But there are several challenges also associated with analysis of data produce by these technologies as high throughput data came in form of short reads, and also containing several artifacts. We have developed several modules for the analysis of Next Generation Sequencing (NGS) data, generated after sequencing of whole genomes, transcriptomes and human exomes.

Automated pipeline for whole genome assembly and annotation We have developed an automated pipeline for genome assembly and annotation of microbial genomes. User can provide path of input sequencing reads files and parameters in the configuration file for the pipeline. This pipeline work in three steps; (i) Filtering of genome sequencing data, (ii) Genome assembly of filtered reads, (iii) Genome annotation of assembled genome.

USAGE: assemb_anno.pl -i (Configuration file) -o (Output directory name)
Example Command: ./assemb_anno.pl -i Configuration_file -o my_out
-i Configuration_file
-o Output Directory

Benchmarking of Genome assemblers (GenomeABC server) Recently, several algorithms have been developed for assembling of whole genome from short reads. A number of algorithms are available free for public use in form of software packages such as Velvet, SOAPdenovo, AbySS, Euler-sr, Edena and SSAKE. Presently, it is difficult for a user to choose appropriate assembler for their genomes due to lack of benchmarking of existing genome assemblers. We have developed GenomeABC software for the bencmarking of assembled genomes. Here, we have included three modules for the purposes; (i) Benchmarking of genome assembles, (ii) Generation of artificial genome and simulated reads, (iii) Generation of mutated genome and simulated reads corresponding to this.

(i) Benchmarking of genome assembles
This is a major module of GenomeABC which allows users to evaluate their assemblers. In order to use this module user should provide reference genome and contigs generated by their assemblers. This module will compare contigs and reference genome in order to evaluate performance of assemblers. In this study, BLAT is used to map contigs on reference genome.

USAGE: benchmarking_new_assembled_genome.pl -c (fasta format contig file) -r (fasta format reference genome file) -o (output file name)
Example Command: ./benchmarking_new_assembled_genome.pl -c contigs.fasta -r ref.fasta -o out.txt
-c Sequence in FASTA format
-r Reference genome file
-o Output Directory

(ii) Generation of artificial genome and simulated reads
This module of server allows users to mutate a genome. User should upload reference genome and specify percent of nucleotide tobe mutated in reference genome. This module will randomly mutate the desired number of position (% of mutation) in reference genome. This module also allows users to generate simulated short reads (single-end or paired-end reads). This module will be useful for evaluating assemblers which assemble genomes based on similar reference genomes.

USAGE: make_genome.pl -s (Genome Size (Put 5000000 for 5-Mb)) -a (A % (i.e. 25%)) -t (T % (i.e. 25%)) -g (G % (i.e. 25%)) -c (C % (i.e. 25%)) -l (Read length) -i (Insert length) -v (Coverage) -y (Type of reads) -o (Out directory) -s Size of genome shich have to be created.
-a Percentage of A in the genome.
-t Percentage of T in the genome.
-g Percentage of G in the genome.
-c Percentage of C in the genome.
-l Read length.
-i Insert length.
-v Coverage.
-y Type of reads(single end (1) or paired end (2)).
-o Output directory name.

(iii) Generation of mutated genome and simulated reads
This module of server allows users to mutate a genome. User should upload reference genome and specify percent of nucleotide to be mutated in reference genome. This module will randomly mutate the desired number of position (% of mutation) in reference genome. This module also allows users to generate simulated short reads (single-end or paired-end reads). This module will be useful for evaluating assemblers which assemble genomes based on similar reference genomes.

USAGE: make_mut_genome.pl -i (Input genome fasta file) -m (Percentage of mutation) -l (Read length) -f (Insert length) -c (Coverage) -y (Type of reads) -o (Out put file)
-i Input genome file.
-m Percentage of mutation.
-l Read length.
-f Insert length.
-c Coverage.
-y Type of reads(single end (1) or paired end (2)).
-o Output directory name.

Variation detection in normal-tumor paired data We have developed a pipeline for the identification of SNPs and somatic variations among normal-tumor paired sequencing data. User should provide sequencing data of tumor sample and normal tissue sample of same individual for the comparison of both data simultaneously and identification of SNPs and somatic variation. This pipeline works in several steps by usingdifferent kind of freely available tools; (i) Filtering of sequencing data, (ii) Alignment of filtered reads to human genome, (iii) Variation detection in the normal-tumor samples (IV) Mapping of somatic varaiations at gene level.

USAGE: variation_detect.pl -i (Configuration file) -o (Output directory name)
Example Command: ./variation_detect.pl Configuration_file -o my_out
-i Configuration_file
-o Output Directory