gatk_basic

A very simple and basic shell script to run GATK best practice pipeline for Ubuntu environment.

This is a basic variant calling and annotation pipeline for beginners using the GATK haplotype caller for variant calling and snpeff for variant annotation. It is written as a single shell script to help people new to variant calling and bioinformatics. This script will get the job done for basic variant calling and annotation purpose. This pipeline is not to be used for clinical samples as this is written only for educational purposes.

There are two scripts included in this repository. The scripts can be found in the gatk_basic/scripts folder.

GATK_pipeline.sh script runs the pipeline for variant calling and variant annotation step by step.

geneReport.sh gets the missense variants in a particular gene.

Tools Required:
BWA, Samtools, Picard, Tabix, GATK, SnpEff

To Run:

Install required tools.
Create a directory and place the shell scripts (GATK_pipeline.sh & geneReport.sh) and the "header.txt" file inside the same directory (> mkdir scripts)
Place samples inside the directory or you can specify the path of the file (> mkdir samples)
Provide samples name manually inside the script, if single end or a single file with collapsed paired-end reads -> sample="read.fq" if paired end -> sample="read_1.fq read_2.fq"
Download and provide the corresponding reference genome for your data coordinates
Specify the path of jar file for Picard and GATK tools.
Select corresponding SnpEff database for your data (eg: hg38, hg19)

Tools Required and Installation guide (Linux-Ubuntu):

Open the Terminal program on your Ubuntu and copy paste these commands in same sequence

BWA (Alignment tool)

> sudo apt-get install bwa

Samtools (variant calling and sequence manipulation)

> sudo apt-get install samtools

Picard (required JAVA v1.9) (sequence data manipulation and cleaning)

> wget https://github.com/broadinstitute/picard/releases/download/2.18.1/picard.jar
(place this jar file in a folder where you want so that you can reuse it whenever you want to run Picard tool. PS: Picard is the tool used for getting quality metrics from the sequence data and ALSO to mark & PCR duplicates).

Tabix (required JAVA v1.9) (data transformation)

> sudo apt-get install tabix

GATK (required JAVA v1.9) (variant calling)

Download the jar file from (https://software.broadinstitute.org/gatk/download/archive)

SnpEff & SnpSift (required JAVA v1.9) (variant annotation)

1. Download the jar file from (http://snpeff.sourceforge.net/download_donate.html)

2. Building Database for reference genome from UCSC(refseq ids: NM_*, NP_*)
  * Please follow the instructions in the given link
  (http://snpeff.sourceforge.net/SnpEff_manual.html#databases) - Option 3: Building a database from RefSeq table from UCSC

FastQC (fast raw data quality control)

> sudo apt-get install fastqc

XML parser for Refseq Protein id annotation (xml parsing for fetching and extracting NP ids from NCBI)

> sudo apt-get install libxml2-utils

Links to download annotation source files.

Reference genome --> http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

Knownsites for GATK-Recalibration --> ftp://[email protected]/bundle/hg38/dbsnp_138.hg38.vcf.gz

Giving permissions to the scripts to be run by users:

chmod +x GATK_pipeline.sh

chmod +x geneReport.sh

Scripts notes:

--> GATK_pipeline.sh is for doing QC, aligning the samples to reference, variant calling with GATK haplotype caller and annotating the variants for functional effects with snpeff.

--> geneReport.sh is for searching your gene of interest in annotated vcf file which is the result of GATK_pipeline.sh script.

--> you have to first run GATK_pipeline.sh script. This will produce a set of results files and out of which give annotated vcf file (gene.ann.vcf) and gene of interest (one or many as comma separated value) as Input.

Variant Calling:

./GATK_pipeline.sh

OR

bash GATK_pipeline.sh

Gene wise Report:

NOTE: for this step, make sure the file header.txt is in the same folder as the script

./geneReport.sh

OR

bash geneReport.sh

The results from this script will be gene.txt, gene.report.txt, gene.missense.txt. The file gene.missense.txt will contain only missense variants in the given gene.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gatk_basic

To Run:

Tools Required and Installation guide (Linux-Ubuntu):

Links to download annotation source files.

Giving permissions to the scripts to be run by users:

Scripts notes:

Variant Calling:

Gene wise Report:

About

Releases

Packages

Languages

License

drvenki/gatk_basic

Folders and files

Latest commit

History

Repository files navigation

gatk_basic

To Run:

Tools Required and Installation guide (Linux-Ubuntu):

Links to download annotation source files.

Giving permissions to the scripts to be run by users:

Scripts notes:

Variant Calling:

Gene wise Report:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages