-
Notifications
You must be signed in to change notification settings - Fork 2
Home
babessell1 edited this page Mar 11, 2022
·
15 revisions
This is a snakemake workflow that aims to do several things, using as much parallelization as possible:
- Given a CSV file containing: SRR codes, target output names, library layouts, library preparation methods
- Generate genome files using STAR
- Download each SRR code in parallel using prefetch
- Unpack the
.sra
files using parallel-fastq-dump, generating.fastq.gz
files - Optionally trim the resulting
.fastq.gz
files (using Trim Galore) - Perform FastQC on the parallel-fastq-dump files, and optionally on the resulting trimmed files
- Perform STAR align on files from parallel-fastq-dump (or trim) files to the generated genome files
- Optionally get RNAseqMetrics using Picard
- Optionally get insert sizes using Picard
- Optionally get fragment sizes using RSeQC
- Perform MultiQC, using the files from parallel-fastq-dump, FastQC, and STAR aligner
- Organize a MADRID_inputs file that can be directly interfaced with https://github.com/HelikarLab/MADRID to aid with metabolic drug discovery and repurposing.
This pipeline is primarily designed to interface the GEO Database with MADRID, and should be run in a high-performance computing cluster, as the memory requirement is quite high to use STAR (about 40GB for the human genome). Even if you do not plan to use MADRID, if your goal is to align fastq files from bulk RNA-seq, perform essential quality control, and output gene counts files from STAR for transcription-based model construction such as Differential Gene Expression Analysis, this pipeline could be of service.
Created by Josh Loecker and Brandt Bessell