Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNA-seq and DNA-seq data were mapped to different reference genome #56

Open
YiweiNiu opened this issue Sep 6, 2022 · 7 comments
Open

Comments

@YiweiNiu
Copy link

YiweiNiu commented Sep 6, 2022

Hi,

I am using JACUSA2 to detect RNA editing events with paired RNA-seq and DNA-seq data.

However, DNA-seq and RNA-seq data were mapped to different reference genomes. The reference genomes were both hg38 with the same names for major chromosomes, but the reference genome for DNA-seq has some alternative scaffolds and decoy sequences.

Here is the reference genome for DNA-seq: GRCh38_full_analysis_set_plus_decoy_hla.fa, and the reference genome for RNA-seq: Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC92.fasta.

First, I encountered the error: "Sequence Dictionaries of BAM files do not match"

INFO   	00:00:00  Computing overlap between sequence records.
htsjdk.samtools.SAMException: Sequence Dictionaries of BAM files do not match /home2/niuyw/project/MEI2/gatkout/HG00315/HG00315.final.bam and /home2/niuyw/Data_processing/STAR/Geuvadis/HG00315.sortedByCoord.q255.rmDup.bam
	at lib.util.AbstractMethod.getSAMSequenceRecords(AbstractMethod.java:247)
	at lib.util.AbstractMethod.initCoordinateProvider(AbstractMethod.java:189)
	at lib.cli.CLI.processArgs(CLI.java:308)
	at lib.util.AbstractTool.run(AbstractTool.java:48)
	at jacusa.JACUSA.main(JACUSA.java:96)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:61)
--------------------------------------------------------------------------------
JACUSA2 Version: 2.0.2-RC call-2 -p 4 -a B,D,H:condition=1,I,S,Y -s -f B -q 35 -r /home2/niuyw/Data_processing/JACUSA2/Geuvadis/HG00315/HG00315.txt /home2/niuyw/project/MEI2/gatkout/HG00315/HG00315.final.bam /home2/niuyw/Data_processing/STAR/Geuvadis/HG00315.sortedByCoord.q255.rmDup.bam
--------------------------------------------------------------------------------
java.lang.NullPointerException
	at lib.worker.WorkerDispatcher.hasNext(WorkerDispatcher.java:60)
	at lib.worker.WorkerDispatcher.run(WorkerDispatcher.java:64)
	at lib.util.AbstractTool.run(AbstractTool.java:60)
	at jacusa.JACUSA.main(JACUSA.java:96)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:61)

Then I replaced the RNA-seq BAM header using the one from the DNA-seq BAM, but the output was empty. Here is log info:

Ignoring SAM validation error: ERROR: Record 1, Read name ERR188023.16833774, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 2, Read name ERR188023.9783723, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 3, Read name ERR188023.20088878, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.16833774, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 4, Read name ERR188023.22658761, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.9783723, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 5, Read name ERR188023.9065204, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.20088878, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 6, Read name ERR188023.20059829, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.22658761, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 7, Read name ERR188023.16299520, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.9065204, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 8, Read name ERR188023.20056934, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.20059829, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 9, Read name ERR188023.14789090, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.16299520, RG ID on SAMRecord not found in header: rg1
...

The RNA-seq BAM I used was HG00315.star.GRCh38.Aligned.sortedByCoord.markduplicates.bam, and the DNA-seq BAM I used was HG00315.final.cram (I used samtools to convert CRAM to BAM).

I wonder if there is any workaround for this situation. Thanks a lot for your help.

Bests,
Yiwei

@piechottam
Copy link
Contributor

Hi, I know this problem very well - unfortunately there is no way around comparing the sequence directories of BAM files.

Like you suggested, I would replace the sequence header AND remove scaffolds that are NOT part of the new sequence directory.

It seems that by doing so, you replaced the entire header INCLUDING Read Group IDs.
You will have to merge the existing Read Group IDs and new sequence header manually...

$ samtools view -H <DNA-BAM> | grep "^@RG" > read_ids_header.sam
$ samtools view -H <RNA-BAM> | grep "^@SQ" > seq_ids_header.sam
...

@YiweiNiu
Copy link
Author

YiweiNiu commented Sep 6, 2022

Hi, thank you for your quick reply.

I did as you said, and JACUSA2 now ran successfully without any errors.

However, the output was empty. I wonder whether it is because I modified the header of the RNA-seq bam file, but I failed to find any error messages in the log file.

Here is the command I used.

input_DNA=/home2/niuyw/project/MEI2/gatkout/${sample}/${sample}.final.bam
input_RNA=$WORKDIR/STAR/$dataset/${sample}.sortedByCoord.q255.rmDup.bam
out_dir=$WORKDIR/JACUSA2/$dataset/$sample

if [ ! -d $out_dir ]; then mkdir -p $out_dir; fi

# reheader input RNA
$path2samtools reheader $WORKDIR/JACUSA2/$dataset/header_new.sam $input_RNA > $out_dir/tmp.bam
$path2samtools index -@ $PPN $out_dir/tmp.bam

# JACUSA2
$path2java -jar $path2JACUSA2 call-2 -p $PPN -a B,D,H:condition=1,I,S,Y -s -f B -q 35 -r $out_dir/${sample}.txt -F1 1024 $input_DNA $out_dir/tmp.bam

gzip $out_dir/${sample}.txt.filtered

Both HG00315.txt and HG00315.txt.filtered.gz were empty.

I also tried to call RNA editing sites using RNA-seq data alone (call-1 mode), and the output seems normal.

Any clues on this? Thanks again for your help.

@piechottam
Copy link
Contributor

  1. Check Chrs.
    Please check if all your BAMs files are referring to the same chromosomes, e.g.:
    1, 2, 3, ... etc.
    OR
    chr1, chr2, ... etc.

  2. Check DNA sample
    Run call-1 on DNA sample - are there any SNPs called? If not, that is strange!

  3. Try less stringent
    Run: $ $path2java -jar $path2JACUSA2 call-2 -p $PPN -a B,D -r $out_dir/${sample}.txt $input_DNA $out_dir/tmp.bam

@YiweiNiu
Copy link
Author

YiweiNiu commented Sep 7, 2022

Hi, thanks a lot for your reply.

  1. Please check if all your BAMs files are referring to the same chromosomes, e.g.:

Yes, both RNA-seq BAM and DNA-seq BAM used the "chr1, chr2..." names. As I mentioned above, they just differ for some alternative scaffolds and decoy sequences.

  1. Run call-1 on DNA sample - are there any SNPs called? If not, that is strange!

I tried this with the command line:

# input
input=/home2/niuyw/project/MEI2/gatkout/${sample}/${sample}.final.bam
out_dir=$WORKDIR/JACUSA2/$dataset/$sample/dna

if [ ! -d $out_dir ]; then mkdir -p $out_dir; fi
$path2java -jar $path2JACUSA2 call-1 -p $PPN -a B,D,I,S,Y -s -f B -q 35 -r $out_dir/${sample}.txt $input

The output was empty, both for the HG00315.txt and HG00315.txt.filtered. That is strange.

  1. Try less stringent

I separated the feature-filtered results in another file using the -s parameter. Since both the output HG00315.txt and feature-filtered file HG00315.txt.filtered were empty, I don't think it was because of stringent filtering.

@piechottam
Copy link
Contributor

Please run: $path2java -jar $path2JACUSA2 pileup -p $PPN -a B,D,I,S,Y -q 35 -r $out_dir/${sample}.txt $input_DNA
This should generate a "samtools like pileup" - it the result file is empty, something is wrong with your DNA.bam.

What mapper did you use for mapping DNA reads? JACUSA2 filter reads with mapping mapping quality < 20. "You might need to lower the mapping quality filter "-m1".

@YiweiNiu
Copy link
Author

YiweiNiu commented Sep 7, 2022

Hi, this data was from the 1KGP project, and they mapped the reads using BWA-MEM.

How do you suggest setting the m1 parameter?

@piechottam
Copy link
Contributor

Do you see reads if you load them up in the IGV?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants