Next Generation Sequencing (NGS) software packages
We have developed an automated pipeline for genome assembly and annotation of microbial genomes. User can provide path of input sequencing reads files and parameters in the configuration file for the pipeline. This pipeline work in three steps; (i) Filtering of genome sequencing data, (ii) Genome assembly of filtered reads, (iii) Genome annotation of assembled genome.
USAGE: assemb_anno.pl -i (Configuration file) -o (Output directory name)
Recently, several algorithms have been developed for assembling of whole genome from short reads. A number of algorithms are available free for public use in form of software packages such as Velvet, SOAPdenovo, AbySS, Euler-sr, Edena and SSAKE. Presently, it is difficult for a user to choose appropriate assembler for their genomes due to lack of benchmarking of existing genome assemblers. We have developed GenomeABC software for the bencmarking of assembled genomes. Here, we have included three modules for the purposes; (i) Benchmarking of genome assembles, (ii) Generation of artificial genome and simulated reads, (iii) Generation of mutated genome and simulated reads corresponding to this.
(i) Benchmarking of genome assembles
We have developed a pipeline for the identification of SNPs and somatic variations among normal-tumor paired sequencing data. User should provide sequencing data of tumor sample and normal tissue sample of same individual for the comparison of both data simultaneously and identification of SNPs and somatic variation. This pipeline works in several steps by usingdifferent kind of freely available tools; (i) Filtering of sequencing data, (ii) Alignment of filtered reads to human genome, (iii) Variation detection in the normal-tumor samples (IV) Mapping of somatic varaiations at gene level.
USAGE: variation_detect.pl -i (Configuration file) -o (Output directory name)
Software packages (.deb) for genome assembly and annotation Installation instructions
This pipeline has been developed for whole genome assembly and annotation of microbes (Bacteria and Fungal genomes). It uses a wide variety of software for the purpose and runs in mainly three steps.
(1) Filter the raw sequencing data
First step is to filter the raw sequencing reads for high quality bases from vector and adaptor contaminated reads.
For this purpose, NGS-QC toolkit is integrated in the pipeline. Bioperl is required for this software to work.
(2) Genome assembly of filtered data
Filtered reads are further used to assemble the genome with user defined parameters (i.e. Hash lengths, K). Genome assembly results are then provided to the user for selecting the best result. Velvet and SOAPdenovo software are used at this step, for genome assembly.
(3) Whole genome annotation
The best genome assembly set is used further for genome annotation. Prokka and MAKER softwares have been integrated for the annotation of bacterial and fungal genomes respectively. Genome assembly set and annotated genome files are produced as output of this pipeline.
Dependencies :-
Several libraries of bioperl need to be installed for full functioning of Prokka and Maker softwares.
The user should be aware of the dependencies of the integrated softwares.
Standalone version of GenomeABC server has been developed for the analysis of assembled genome and benchmark the assemblers. This is a set of simple perl scripts and user can easily use this software. BLAT and Bioperl are the necessary software required to run the GenomeABC software.
This pipeline uses several softwares.
(2) BWA software has been integrated for the alignment of filtered reads to the human reference genome.
(4) Finally, VarScan.v2.3.5 software detects the somatic variations and SNPs in the given sequencing data.
User should have all these software installed to run this pipeline.
Debian packages of all these softwares can be downloaded at the OSDDlinux website (http://osddlinux.osdd.net/ngs.php).
User can download .deb packages from the OSDDlinux page (http://osddlinux.osdd.net/ngs.php). For the installation of these packages, user should have OSDDlinux operating system with /gpsr/software directory. To install, user simply needs to execute the command:-
sudo dpkg -i package.deb
Software would automatically get installed in the /gpsr/software/ directory and executable files can be called from /gpsr/local/bin directory.
Example:- sudo dpkg -i maq.deb
Installation location : /gpsr/software/
Executable present: /gpsr/local/bin
In the era of Next Generation Sequencing (NGS) technology, it is easy to sequence whole genome, exome and transcriptome of an organism. But there are several challenges also associated with analysis of data produce by these technologies as high throughput data came in form of short reads, and also containing several artifacts. We have developed several modules for the analysis of Next Generation Sequencing (NGS) data, generated after sequencing of whole genomes, transcriptomes and human exomes.
Example Command: ./assemb_anno.pl -i Configuration_file -o my_out
-i Configuration_file
-o Output Directory
This is a major module of GenomeABC which allows users to evaluate their assemblers. In order to use this module user should provide reference genome and contigs generated by their assemblers. This module will compare contigs and reference genome in order to evaluate performance of assemblers. In this study, BLAT is used to map contigs on reference genome.
USAGE: benchmarking_new_assembled_genome.pl -c (fasta format contig file) -r (fasta format reference genome file) -o (output file name)
Example Command: ./benchmarking_new_assembled_genome.pl -c contigs.fasta -r ref.fasta -o out.txt
-c Sequence in FASTA format
-r Reference genome file
-o Output Directory
(ii) Generation of artificial genome and simulated reads
This module of server allows users to mutate a genome. User should upload reference genome and specify percent of nucleotide tobe mutated in reference genome. This module will randomly mutate the desired number of position (% of mutation) in reference genome. This module also allows users to generate simulated short reads (single-end or paired-end reads). This module will be useful for evaluating assemblers which assemble genomes based on similar reference genomes.
USAGE: make_genome.pl -s (Genome Size (Put 5000000 for 5-Mb)) -a (A % (i.e. 25%)) -t (T % (i.e. 25%)) -g (G % (i.e. 25%)) -c (C % (i.e. 25%)) -l (Read length) -i (Insert length) -v (Coverage) -y (Type of reads) -o (Out directory)
-s Size of genome shich have to be created.
-a Percentage of A in the genome.
-t Percentage of T in the genome.
-g Percentage of G in the genome.
-c Percentage of C in the genome.
-l Read length.
-i Insert length.
-v Coverage.
-y Type of reads(single end (1) or paired end (2)).
-o Output directory name.
(iii) Generation of mutated genome and simulated reads
This module of server allows users to mutate a genome. User should upload reference genome and specify percent of nucleotide to be mutated in reference genome. This module will randomly mutate the desired number of position (% of mutation) in reference genome. This module also allows users to generate simulated short reads (single-end or paired-end reads). This module will be useful for evaluating assemblers which assemble genomes based on similar reference genomes.
USAGE: make_mut_genome.pl -i (Input genome fasta file) -m (Percentage of mutation) -l (Read length) -f (Insert length) -c (Coverage) -y (Type of reads) -o (Out put file)
-i Input genome file.
-m Percentage of mutation.
-l Read length.
-f Insert length.
-c Coverage.
-y Type of reads(single end (1) or paired end (2)).
-o Output directory name.
Example Command: ./variation_detect.pl Configuration_file -o my_out
-i Configuration_file
-o Output Directory
Program Purpose Usage ABySS Genome assembler Command line Amos Genome assembler Command line Artemis Genome Viewer Graphical user interface Augustus Gene prediction Command line Amphora Phylogenomic Inference Pipeline for Bacterial and Archaeal Sequences Command line Annovar Variation prediction Command line Blat Alignment tool, faster than BLAST Command line Blast Alignment tool Command line Brig Genome Viewer Graphical user interface Celera Genome assembler Command line Chimerascan Chimeric transcripts detector Command line Cufflinks Transcript assembly, differential expression, and differential regulation for RNA-Seq Command line Edena Genome assembler Command line EVM Gene prediction Command line FastQC Filter NGS data i.e. Short reads Graphical user interface FastXQC Filter NGS data i.e. Short reads Command line Genemark Gene prediction Command line Genosets Comparative Genomics visualization Graphical user interface Glimmer Gene prediction Command line IGV Genome Viewer Graphical user interface JSpecies Genome comparison Graphical user interface ALLPATHS-LG Genome assembler Command line Maker Gneome annotation pipeline, Eukaryotes Command line Maq Short reads aligner Command line Mauve Genome Viewer Command line Mummer Genome comparison Command line Mira Genome assembler Command line NGS-QC toolkit Filter NGS data i.e. Short reads Command line Pasha Parallelized Short Read Assembly Command line Ray Genome assembler Command line RNAmmer RNA prediction Command line SOAPdenovo Genome assembler Command line SOAP-aligner Short reads aligner Command line Spades Genome assembler Command line Tablet Genome alignment viewer Graphical user interface Tophat A spliced read mapper for RNA-Seq Command line Vaast Variation prediction Command line VCFtools Variation prediction Command line Newbler Genome assembler Command line
(1) First step is to filter the raw sequencing reads for high quality bases from the vector and adaptor contaminated reads.
For this purpose, NGS-QC toolkit has been integrated in the pipeline.
(3) In the step further, SAMtool software processes the alignment files.
Dependencies:- User should have all the mentioned softwares in the default path i.e. /gpsr/local/bin to run this pipeline.