Skip to content

Commit

Permalink
Vep vcf output format
Browse files Browse the repository at this point in the history
  • Loading branch information
dmaziec committed Oct 12, 2020
1 parent 1e00b33 commit 5684640
Showing 1 changed file with 2 additions and 3 deletions.
5 changes: 2 additions & 3 deletions core/src/main/scala/org/bdgenomics/cannoli/Vep.scala
Original file line number Diff line number Diff line change
Expand Up @@ -83,12 +83,11 @@ class Vep(
.setExecutable(args.executable)
.add("--format")
.add("vcf")
.add("--vcf")

This comment has been minimized.

Copy link
@heuermh

heuermh Oct 26, 2020

Are both --format vcf and --vcf required?

This comment has been minimized.

Copy link
@dmaziec

dmaziec Oct 26, 2020

Author Owner

We should specify --format vcf when we pass input data from STDIN, otherwise following error is encountered:
Cannot detect format from STDIN - specify format with --format [format] at /usr/local/share/ensembl-vep-100.2-0/modules/Bio/EnsEMBL/VEP/Parser.pm line 369.

This comment has been minimized.

Copy link
@heuermh

heuermh Oct 26, 2020

Ah, that makes sense, thank you!

.add("--output_file")
.add("STDOUT")
.add("--vcf_info_field")
.add("ANN")

This comment has been minimized.

Copy link
@heuermh

heuermh Oct 26, 2020

Is this now the default, or is there another reason for removing this?

This comment has been minimized.

Copy link
@dmaziec

dmaziec Oct 26, 2020

Author Owner

Unfortunately, having this flag it leads to another failure:

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
	at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
	... 31 more
Caused by: java.lang.NumberFormatException: For input string: "7264-7275"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:580)
	at java.lang.Integer.parseInt(Integer.java:615)
	at org.bdgenomics.adam.converters.TranscriptEffectConverter$.parseFraction(TranscriptEffectConverter.scala:99)
	at org.bdgenomics.adam.converters.TranscriptEffectConverter$.parseTranscriptEffect(TranscriptEffectConverter.scala:143)
	at org.bdgenomics.adam.converters.TranscriptEffectConverter$$anonfun$parseAnn$1.apply(TranscriptEffectConverter.scala:183)
	at org.bdgenomics.adam.converters.TranscriptEffectConverter$$anonfun$parseAnn$1.apply(TranscriptEffectConverter.scala:183)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.bdgenomics.adam.converters.TranscriptEffectConverter$.parseAnn(TranscriptEffectConverter.scala:183)
	at org.bdgenomics.adam.converters.TranscriptEffectConverter$.org$bdgenomics$adam$converters$TranscriptEffectConverter$$parseAndFilter$1(TranscriptEffectConverter.scala:205)
	at org.bdgenomics.adam.converters.TranscriptEffectConverter$$anonfun$convertToTranscriptEffects$1.apply(TranscriptEffectConverter.scala:217)
	at org.bdgenomics.adam.converters.TranscriptEffectConverter$$anonfun$convertToTranscriptEffects$1.apply(TranscriptEffectConverter.scala:217)
	at scala.Option.flatMap(Option.scala:171)
	at org.bdgenomics.adam.converters.TranscriptEffectConverter$.convertToTranscriptEffects(TranscriptEffectConverter.scala:217)
	at org.bdgenomics.adam.converters.VariantContextConverter.formatTranscriptEffects(VariantContextConverter.scala:700)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$26.apply(VariantContextConverter.scala:718)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$26.apply(VariantContextConverter.scala:718)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$74$$anonfun$apply$79.apply(VariantContextConverter.scala:1722)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$74$$anonfun$apply$79.apply(VariantContextConverter.scala:1722)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$75.apply(VariantContextConverter.scala:1727)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$75.apply(VariantContextConverter.scala:1725)
	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
	at scala.collection.immutable.List.foldLeft(List.scala:84)
	at org.bdgenomics.adam.converters.VariantContextConverter.org$bdgenomics$adam$converters$VariantContextConverter$$convert$1(VariantContextConverter.scala:1724)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$makeVariantFormatFn$1.apply(VariantContextConverter.scala:1759)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$makeVariantFormatFn$1.apply(VariantContextConverter.scala:1759)
	at org.bdgenomics.adam.converters.VariantContextConverter.convert(VariantContextConverter.scala:354)
	at org.bdgenomics.adam.rdd.variant.VCFOutFormatter.convertIterator$1(VCFOutFormatter.scala:123)
	at org.bdgenomics.adam.rdd.variant.VCFOutFormatter.read(VCFOutFormatter.scala:129)
	at org.bdgenomics.adam.rdd.OutFormatterRunner.<init>(OutFormatter.scala:32)
	at org.bdgenomics.adam.rdd.GenomicDataset$$anonfun$15.apply(GenomicDataset.scala:868)
	at org.bdgenomics.adam.rdd.GenomicDataset$$anonfun$15.apply(GenomicDataset.scala:830)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

This comment has been minimized.

Copy link
@heuermh

heuermh Oct 26, 2020

This I have seen before -- despite being party to the VCF ANN specification VEP must be doing something different than snpEff. I will investigate further!

This comment has been minimized.

Copy link
@dmaziec

dmaziec Oct 26, 2020

Author Owner

From my quick research. Above error specifies that
java.lang.NumberFormatException: For input string: "7264-7275"
and
at org.bdgenomics.adam.converters.TranscriptEffectConverter$.parseFraction(TranscriptEffectConverter.scala:99)

I used docker vep locally to investigate output: ANN=CCCCCCCCGTG|3_prime_UTR_variant|MODIFIER|ARID1A|ENSG00000117713|Transcript|ENST00000324856|protein_coding|20/20|||| 7264-7275|||||||1||HGNC|HGNC:11110,CCCCCCCCGTG|downstream_gene_variant|MODIFIER|ARID1A|ENSG00000117713|

Field: 7264-7275 is at 12 position when it is required to have field possible to split with "/" looking at given function

So it does not meet expectations

This comment has been minimized.

Copy link
@heuermh

heuermh Oct 26, 2020

Thanks! That is similar to what I found, hopefully bigdatagenomics/adam#2278 will address it properly

.add("--terms")
.add("so")
.add("SO")

This comment has been minimized.

Copy link
@heuermh

heuermh Oct 26, 2020

In bigdatagenomics#251 I removed --terms SO because it is the default. Can these also be removed here?

This comment has been minimized.

Copy link
@dmaziec

dmaziec Oct 26, 2020

Author Owner

Yes, i will remove this

.add("--no_stats")
.add("--offline")
.add("--dir_cache")
Expand Down

0 comments on commit 5684640

Please sign in to comment.