Skip to content
babessell1 edited this page Mar 11, 2022 · 15 revisions

Welcome to the FastqToGeneCounts wiki!

This is a snakemake workflow that aims to do several things, using as much parallelization as possible:

  1. Given a CSV file containing: SRR codes, target output names, library layouts, library preparation methods
  2. Generate genome files using STAR
  3. Download each SRR code in parallel using prefetch
  4. Unpack the .sra files using parallel-fastq-dump, generating .fastq.gz files
  5. Optionally trim the resulting .fastq.gz files (using Trim Galore)
  6. Perform FastQC on the parallel-fastq-dump files, and optionally on the resulting trimmed files
  7. Perform STAR align on files from parallel-fastq-dump (or trim) files to the generated genome files
  8. Optionally get RNAseqMetrics using Picard
  9. Optionally get insert sizes using Picard
  10. Optionally get fragment sizes using RSeQC
  11. Perform MultiQC, using the files from parallel-fastq-dump, FastQC, and STAR aligner
  12. Organize a MADRID_inputs file that can be directly interfaced with https://github.com/HelikarLab/MADRID to aid with metabolic drug discovery and repurposing.

This pipeline is primarily designed to interface the GEO Database with MADRID, and should be run in a high-performance computing cluster, as the memory requirement is quite high to use STAR (about 40GB for the human genome). Even if you do not plan to use MADRID, if your goal is to align fastq files from bulk RNA-seq, perform essential quality control, and output gene counts files from STAR for transcription-based model construction such as Differential Gene Expression Analysis, this pipeline could be of service.

Sections

  1. Download
  2. Setup
  3. Running
Clone this wiki locally