This pipeline will process bulk RNASeq sequencing data from fastq files, bam files or raw count matrices. Several most common aligning, read-counting programs as well as programs for differential gene expression. In addition, tools for generating QC reports, trimming and downloading SRA archives are implemented. Possibility of defining simple custom workflows exists as well.
You will need to copy or link fastq files in the fastq
directory as well as provide
metadata.tsv
with several required columns (See Input section).
Next the config.yaml
file needs to be edited to reflect your desired output and
analysis parameters. Please see the parameter section of the readme file for detailed
description. config_extra.yaml
file contains specific extra parameters that will be
passed to individual programs (aligners, read-counters etc.)
Run the snakemake pipeline by navigating to the folder with Snakefile
and executing
it by command snakemake
. One can optionally perform a test dry-run to see if pipeline
was correcly configured by running snakemake -n
Please see the documentation to Snakemake pipeline language for detailed introduction how to run the program.
Fastq files with sequenced reads
metadata.tsv
with several required columns:
sample
- sample name that represents one or more fastq filesfq
- name of fastq file, can be compressedlane
- lane where the sample was one, multiple ones are allowed per sampleread
- '_R1' or '_R2' string at the end of the file basename in case the reads are pairedstranded
- indication whether stranded or unstranded protocols were used when generating libraries. Values 'yes', 'no' and 'reverse' are acceptedgroup
- experimental group of a sample
Pipeline will merge files that are spread across multiple lanes and pair them correctly if necessary.
When run starting from fastq files, several output directories are created
align
contains aligned bam filescount
contains read countsdiffexp
contains data with differential expression divided into sub-analyseslogs
contains logs for all run processes as well as quality control reports
Note: Due to the fact that differential expression analysis usually needs more tailoring
based on experiment data, it is possible to add additional analyses without the need
to re-run alignment and read counting. It is sufficient to change the diffexp: outdir
parameter in the config.yaml
file, change the parameters of DEG analysis such as
FDR or groups and it will be saved to its own diffexp
subfolder. Parameters of the
analysis will be stored alongside.
Pipelines are defined in snakemake/rules/pipelines.smk
file. New pipelines from
existing rules can be constructed if the output:input files of subsequent rules match.
Differential gene expression rules has to be defined as last. Skipping rules can be
achieved by specifying skip_<process>.smk
in the pipelines.
Currently, following pipeline names are accepted, followed by applied rules.
download_only
- skip_align, skip_count, skip_diffexpstar_only
- star, skip_count, skip_diffexphisat_only
- hisat, skip_count, skip_diffexptophat_only
- tophat, skip_count, skip_diffexpbowtie2_only
- bowtie2, skip_count, skip_diffexpsalmon_only
- salmon, skip_count, skip_diffexpkallisto_only
- kallisto, skip_count, skip_diffexpfeaturecounts_only
- skip_align, featurecounts, skip_diffexphtseq_only
- skip_align, htseq, skip_diffexpstringtie_only
- skip_align, stringtie, skip_diffexpstringtie_expression_only
- skip_align, stringtie_expression, skip_diffexpcufflinks_only
- skip_align, cufflinks, skip_diffexpdeseq_only
- skip_align, featurecounts, deseqdeseq
- star, featurecounts, deseq
experiment_name
- name of the experiment (no whitespaces)metadata
- names of the file where experiment metadata is specifiedthreads
- maximum amount of threads the pipeline is allowed to usespecies
- source species of the experiment. This is mainly used during conversion of gene IDs, human, mouse and rat are currently supported.pipeline
- name of the pipeline to use for the analysis. Can be any that is specified in thesnakemake/rules/pipelines.smk
fileindex
- location of genome index for selected alignergtf
- location of gtf file with features to be counted, chromosome naming and locations have to be compatible with the index usedfasta
- fasta file used to generate the index. This is required for some aligners such as bowtie2
adapters_single
- see cutadapt adapter types for detailsadapters_paired
- see cutadapt adapter types for detailsquality
- see cutadapt quality trimming for detailsextra
- extra parameters to cutadapt
calulate
- whether coverage should be calculated. accepts "yes" or "no" string. This solution is a bit cumbersome, but works for nowsplit_strands
-bin_size
-normalize_using
- methods for normalizing reads,extra
- extra parameters to coverage calculations
multimap
- if multimapping reads should be countedoverlap
- if reads belonging to the overlapping genes should be counted. If more granular control is needed about how to count overlaps, these need to be specified inconfig_extra.yaml
outdir
- name of the analysis and final output directory. If multiple analyses of differential expression need to be conducted without repeating the alignment and read counting, it is sufficient to change the name of this directory along with the analysis parameters and it will be added todiffexp
directorydesign
- specify the experimental design. the default value for one design variable is 'group'. If only one grouping variable is present, specify just the name of the column inmetadata.tsv
file. If you wish to use formula to combine multiple groups, mark the beginning with ~ any use combination of columns present inmetadata.tsv
file.reference_levels
- ordering of the reference levels in thedesign
column from which contrasts will be constructed. If only one level is specified it is taken as baseline. If empty, alphabetical order will be used.contrast_type
- specify which type of comparisons to include, can be one of:- type 'A' - specify list of columns in metadata file, all combinations of contrasts will be generated eg. ["condition", "gender"]
- type 'B' - named lists of three items, name of grouping and two ref_level combinations. The baseline is specified as second. Name has to be unique eg. treatment: ["condition", "treat", "ctrl"]
- type 'C' - lists of contrasts to include, has to correspond to resultsNames of dds object. Names have to be unique (DESeq2 only). eg. comparison1: ['condition_Trt_vs_Ctrl', 'genotypeMU.conditionTrt']
- type 'C' - numeric lists of contrasts to include and their weights, has to correspond to resultsNames of dds object. Names have to be unique eg. comparison1: [-1, 1, 0, 0, 1]
- type 'D' - full and reduced model formula in a list (sleuth only)
- type 'E' - link to file with design matrix
contrasts
- list of actual contrast, using naming conventions specified abovefdr
- FDR cutoff to consider for diferentially expressed genesgene_min_readcount
- filter out genes that have smaller total amount of reads across all samplesinput_gene_ids
- type of gene IDs present in the input file (generally the read counts). Can be one of supported by org.db, usually "ENSEMBL", "ENTREZ", "SYMBOL".output_gene_ids
- type of gene IDs to use in output files.drop_no_sybmol
- drop genes from output that don't map to any gene symbol, otherwise they will be replaced by gene ID.
Contains some specific config parameters such as extra parameters to aligners, read counting etc.
Rule that will download file from SRA archive is available. In order to download SRR files, a metadata.tsv
file has to constructed as follows. sra
column containing the SRR accession id has to be added. In case the SRR file contains paired reads, both resulting fastq files have to be specified with the same SRR id and 'R1' and 'R2' in read
column. SRA archive files are usually not split across lanes, so it is sufficient to specify '1' in the lane
column. Column containing fastq files names will have to match the names of the files resulting from fasterq-dump
tool. I haven't found a way how to specify the read suffixes, so the resulting
file name has to have a format {srr_id}_{read}.fastq.gz
for paired reads and {srr_id}.fastq.gz
for unpaired reads. Fastq files are automatically gzipped, using either gzip
of multithreaded pigz
if available on the system.