This workflow was used for RNAseq data processing in the following studies:
- Gorniak et al., (2023) Different lanthanide elements induce strong gene expression changes in a lanthanide-accumulating methylotroph. Microbiol Spectr Nov 1:e0086723
- Wegner et al., (2021) Extracellular and Intracellular Lanthanide Accumulation in the Methylotroph Beijerinckiaceae Bacterium RH AL1. Appl Environ Microbiol 87:e03144-20
- Wegner et al., (2020) Lanthanide-Dependent Methylotrophs of the Family Beijerinckiaceae: Physiological and Genomic Insights. Appl Environ Microbiol 86:e01830-19
The workflow is based on the following tools:
fastQC
bbmap
part of the BBtools suiteSortMeRNA
samtools
featureCounts
part of the Subread package
The separate installation of the tools is not necessary, they are installed 'on the fly' (see Usage below).
Snakemake
should be installed as outlined in its documentation for instance using conda
/mamba
. It is recommended to create a dedicated conda
environment for Snakemake.
Paired-end sequencing data (stored in data/
) is first subjected to quality-control and adapter-trimming using bbduk
. Quality reports are written using fastQC
before and after trimming. mRNA-derived read pairs are identified by querying QCed reads against SSU and LSU databases using SortMeRNA
. Next, mRNA-derived sequenced are mapped onto a reference genome with bbmap
. Readcounts per feature (=CDS) are determined through featureCounts
. The below DAG graph outlines the different processes of the workflow.
Start by cloning the repository and move into respective directory.
git clone https://github.com/wegnerce/smk_rnaseq.git
cd smk_rnaseq
SortMeRNA
relies on sequence databases, the used SSU and LSU databases and the corresponding indices can be downloaded from zenodo
.
cd resources
wget https://zenodo.org/record/8073631/files/smk_rnaseq_sortmerna_DBs.zip
unzip smk_rnaseq_sortmerba_DBs.zip
cd ..
Paired sequence data (R{1,2}.fastq.gz) should be stored under data/
. The repository contains two pairs of exemplary files (La.1_R1.fastq.gz + La.1_R2.fastq.gz & Nd.1_R1.fastq.gz + Nd.1_R2.fastq.gz).
config/
contains, besides from the configuration of the workflow (config/config.yaml
), a tab-separated table samples.tsv
, which contains a list of all datasets, one per line. The workflow expects *.fastq.gz
files and R{1,2}
as prefixes for forward and reverse read files.
From the root directory of the workflow, processing the data can then be started.
# --use-conda makes sure that needed tools are installed based
# on the requirements specified in the respective *.yaml in /envs
snakemake --use-conda
The directory structure of the workflow is shown below:
├── config
│ ├── config.yaml
│ └── samples.tsv
├── data
│ ├── La.1_R1.fastq.gz
│ ├── La.1_R2.fastq.gz
│ ├── Nd.1_R1.fastq.gz
│ └── Nd.1_R2.fastq.gz
├── environment.yaml
├── LICENSE
├── README.md
├── resources
│ ├── adapters.fa
│ ├── RHAL1_chromosome_plasmid.fa
│ └── RHAL1_chromosome_plasmid.saf
├── smk_rnaseq.svg
└── workflow
├── envs
│ ├── bbmap.yaml
│ ├── fastqc.yaml
│ ├── featurecounts.yaml
│ └── sortmerna.yaml
├── rules
│ ├── featurecounts.smk
│ ├── mapping.smk
│ ├── qc.smk
│ └── sortmerna.smk
└── Snakefile
Output from the different steps of the workflow are stored in /results
and /logs
.
The readcount table generated by featureCounts
(results/04_COUNTS/readcounts_featureCounts.txt
) can be used for downstream differential gene expression analysis.
©️ Carl-Eric Wegner, 2023