Skip to content

Snakemake workflow for the processing of RNAseq data

License

GPL-3.0, Unknown licenses found

Licenses found

GPL-3.0
LICENSE
Unknown
LICENSE.md
Notifications You must be signed in to change notification settings

wegnerce/smk_rnaseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub tag License Made with Python DOI SnakeMake

smk_rnaseq - A Snakemake-based workflow for RNAseq data processing

This workflow was used for RNAseq data processing in the following studies:

❗ Needed/used software

The workflow is based on the following tools:

The separate installation of the tools is not necessary, they are installed 'on the fly' (see Usage below).

Snakemake should be installed as outlined in its documentation for instance using conda/mamba. It is recommended to create a dedicated conda environment for Snakemake.

📘 Description of the workflow

Paired-end sequencing data (stored in data/) is first subjected to quality-control and adapter-trimming using bbduk. Quality reports are written using fastQC before and after trimming. mRNA-derived read pairs are identified by querying QCed reads against SSU and LSU databases using SortMeRNA. Next, mRNA-derived sequenced are mapped onto a reference genome with bbmap. Readcounts per feature (=CDS) are determined through featureCounts. The below DAG graph outlines the different processes of the workflow.

DAG of smk_rnaseq.

🔨 Usage

Start by cloning the repository and move into respective directory.

git clone https://github.com/wegnerce/smk_rnaseq.git
cd smk_rnaseq

SortMeRNA relies on sequence databases, the used SSU and LSU databases and the corresponding indices can be downloaded from zenodo.

cd resources
wget https://zenodo.org/record/8073631/files/smk_rnaseq_sortmerna_DBs.zip
unzip smk_rnaseq_sortmerba_DBs.zip
cd ..

Paired sequence data (R{1,2}.fastq.gz) should be stored under data/. The repository contains two pairs of exemplary files (La.1_R1.fastq.gz + La.1_R2.fastq.gz & Nd.1_R1.fastq.gz + Nd.1_R2.fastq.gz).

config/ contains, besides from the configuration of the workflow (config/config.yaml), a tab-separated table samples.tsv, which contains a list of all datasets, one per line. The workflow expects *.fastq.gzfiles and R{1,2} as prefixes for forward and reverse read files.

From the root directory of the workflow, processing the data can then be started.

# --use-conda makes sure that needed tools are installed based
# on the requirements specified in the respective *.yaml in /envs
snakemake  --use-conda

The directory structure of the workflow is shown below:

├── config
│   ├── config.yaml 
│   └── samples.tsv
├── data
│   ├── La.1_R1.fastq.gz
│   ├── La.1_R2.fastq.gz
│   ├── Nd.1_R1.fastq.gz
│   └── Nd.1_R2.fastq.gz
├── environment.yaml
├── LICENSE
├── README.md
├── resources
│   ├── adapters.fa
│   ├── RHAL1_chromosome_plasmid.fa
│   └── RHAL1_chromosome_plasmid.saf
├── smk_rnaseq.svg
└── workflow
    ├── envs
    │   ├── bbmap.yaml
    │   ├── fastqc.yaml
    │   ├── featurecounts.yaml
    │   └── sortmerna.yaml
    ├── rules
    │   ├── featurecounts.smk
    │   ├── mapping.smk
    │   ├── qc.smk
    │   └── sortmerna.smk
    └── Snakefile

Output from the different steps of the workflow are stored in /results and /logs.

The readcount table generated by featureCounts (results/04_COUNTS/readcounts_featureCounts.txt) can be used for downstream differential gene expression analysis.

©️ Carl-Eric Wegner, 2023

About

Snakemake workflow for the processing of RNAseq data

Topics

Resources

License

GPL-3.0, Unknown licenses found

Licenses found

GPL-3.0
LICENSE
Unknown
LICENSE.md

Stars

Watchers

Forks

Packages

No packages published

Languages