Skip to content

Genotyping Structural Variants

Glenn Hickey edited this page May 10, 2019 · 12 revisions

SV Genotyping with toil-vg

toil-vg can be used to estimate the genotypes of a given set of variants from short read alignment data from a single sample. The process is as follows:

  • Create and index graph from VCF using toil-vg construct
  • Map reads to graph using toil-vg map
  • Compute genotypes for each variant in the input VCF using toil-vg call

Chromosome 22 HG002 GIAB example

This example shows how to genotype chromosome 22 from HG002 using the GIAB VCF. It can be run locally on most systems. It takes roughly 40 minutes overall on my 8-core I7 desktop. It is straightforward to specify additional chromosomes on the command lines, though this is best done on the cloud. Some of the Toil options required for doing so are explained in the toil-vg README. Example scripts for whole-genome SV genotyping on AWS can be found here.

The sample we are genotyping, HG002, is in the original VCF. This is not a requirement, as any sample can be genotyped against any VCF. Using HG002 allows us to easily compute the accuracy (of vg versus the GIAB calls) in the final step, however.

In general it is best to represent insertions and deletions explicitly in the input VCF if possible. vg supports symbolic alleles (enabled with --handle_svs in toil-vg construct), but it often doesn't work in practice as the VCF spec is so loose, and the interface is not presently enabled in toil-vg call. This is not an issue for GIAB as it does not use symbolic alleles.

We begin by constructing and indexing the GIAB SV chr22 graph. GCSA indexing requires several gigs of scratch space, so the --workDir . option below my need to be adjusted accordingly:

rm -rf ./jobStore ; toil-vg construct ./jobStore . --vcf ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz --fasta ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz --regions 22 --out_name GIAB --xg_index --gcsa_index --realTimeLogging --pangenome --flat_alts --handle_svs --alt_path_gam_index --validate --normalize --merge_graphs --gcsa_index_cores $(nproc) --workDir .

Next, we extract 30X chromosome 22 reads from the GIAB BAM.

samtools view ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/HG002.hs37d5.300x.bam  22 -s 1.10 -h -O BAM > HG002_chr22_30X.bam 

Then we map these reads to the graph to produce a GAM.

rm -rf jobStore ; toil-vg map ./jobStore HG002 ./GIAB.xg ./GIAB.gcsa . --bam_input_reads ./HG002_chr22_30X.bam --interleaved --alignment_cores $(nproc) --single_reads_chunk --realTimeLogging

We can then genotype the VCF. Note that the VCF we genotype (specified below with --genotype_vcf) must be the same VCF, or a subset thereof, that was used to construct the graph. Running the command below without specifying the --genotype_vcf or --alt_path_gam options will call small variants from the reads in addition to SVs in the graph.

rm -rf jobStore ; toil-vg call ./jobStore ./GIAB.xg HG002 . --chroms 22 --gams HG002_default.gam --genotype_vcf ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz --alt_path_gam ./GIAB_alts.gam --realTimeLogging --call_chunk_cores $(nproc)

The output genotypes will be in HG002.vcf.gz. They can be compared back to the original GIAB genotypes as follows

wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz
bcftools view HG002_SVs_Tier1_v0.6.vcf.gz -t 22 -O z > giab_chr22.vcf.gz
tabix -f -p vcf giab_chr22.vcf.gz
rm -rf jobStore ; toil-vg vcfeval ./jobStore . --vcfeval_baseline ./giab_chr22.vcf.gz --call_vcf ./HG002.vcf.gz --genotype_eval --sveval --vcfeval_sample HG002 --realTimeLogging

The evaluation results will be in sv_accuracy.tsv and sv_evaluation.tar.gz

cat sv_accuracy.tsv
type	TP	TP.baseline	FP	FN	precision	recall	F1
Total	536	549	387	300	0.5807	0.6466	0.6119
INS	278	283	214	161	0.565	0.6374	0.599
DEL	258	266	173	139	0.5986	0.6568	0.6264