The excretion of pathogens by humans during active infection allows the contemporaneous monitoring of outbreaks in wastewater. Wastewater includes liquids drained from toilets, showers, and dishwashers. Sewage is treated across 5 wastewater treatment plants in Metro Vancouver before becoming effluent. Samples are collected prior to treatment from the non-solid fraction, RNA is extracted, and genomic material is sequenced. This pipeline, WasteFlow, written in Nextflow aims to provide a standardized approach to further resolve presence or absence of SARS-CoV-2 in wastewater, into lineages infecting the population to inform precautionary measures and therapeutic strategies.
- Introduction
- Quick-Start Guide
- Dependencies
- Installation
- Input
- Output
- Parameters
- Workflow
- References
nextflow run BCCDC-PHL/WasteFlow -profile conda --data_dir /path/to/pe/fastq/files/ --ref ./resources/cov2_ref.fasta --primers ./resources/articV5.3.bed
Conda is required to build an environment with required workflow dependencies.
This bioinformatic pipeline requires Nextflow:
conda install -c bioconda nextflow
or download and add the nextflow executable to a location in your user $PATH variable:
curl -fsSL get.nextflow.io | bash
mv nextflow ~/bin/
Nextflow requires Java v8.0+, so check that it is installed:
java -version
The OS-independent conda environments activated upon running WasteFlow are specified in the
envs
folder of the project directory and is built when
-profile conda
is included in the command line. Nextflow will save
the environment to the ./work/conda directory by default. Alternatively, the
necessary conda environment can be saved to a different shared location
accesible to compute nodes by adding --conda_cache /path/to/new/location/
.
To copy the program into a directory of your choice, from desired directory run:
git clone https://github.com/j3551ca/WasteFlow.git
cd WasteFlow
nextflow run main.nf -profile conda --data_dir /path/to/input/data
alternatively:
nextflow run BCCDC-PHL/WasteFlow -profile conda --data_dir /path/to/pe/fastq/files/ --ref ./resources/cov2_ref.fasta --primers ./resources/articV5.3.bed
Run all modules from the WasteFlow directory after cloning repository:
nextflow run main.nf -profile conda --data_dir /path/to/input/data
Run all modules outside of WasteFlow
nextflow run j3551ca/WasteFLow -profile conda --data_dir /path/to/input/data
Run all modules + collate all *mutation_tables.csv, construct mutation measurements, and write cumulative file
nextflow run main.nf -profile conda --data_dir /path/to/input/data --rerun_mut "/path/to/dir/holding/*mutation_table.csv" --mut_dir /path/to/write/cumlative_file
The pipeline requires the following files:
Input | Parameter | Description | Notes |
---|---|---|---|
Reference genome | ref | Reference genome of pathogen of interest for guided read alignment | ./resources/cov2_ref.fasta |
Paired-end sequencing reads | dir | Paired-end sequencing reads. WasteFlow will accept *.fastq.gz, *.fq.gz, *.fastq, .fq [./data_dir/.fq]. | Ensure the absolute path of the directory containing data to be analyzed is used, otherwise MultiQC will throw an error. To merge replicates of the same sample (--combine_reps) the replicate number must be present in the filename as -#- |
Bed file | primers | Primer scheme bed file | ${projectDir}/resources/articV5.3.bed |
Results directory | out_dir | Path of directory to output results into |
Output | Description |
---|---|
freyja_lineage_summary.tsv | File containing Freyja output of relative abundances of pathogen variants in each sample |
*vcf.tsv | Variant Call Format file per sample produced during freyja variants command (iVar) |
*_depths.txt | Depth per position files per sample produced during freyja variants command (samtools) |
multiqc_report.html | QC of reads and alignments for each sample. |
_WasteFlow.log | Log of runtime information like Freyja barcode used, analysis start-time, directories, command executed. |
Parameter | Description | Required | Default |
---|---|---|---|
dir | User's directory that contains input paired-end sequence reads (fastq files). WasteFlow accepts gzip compressed or uncompressed files (*.fastq.gz, *.fq.gz, *.fastq, *.fq). Ensure the absolute path of the directory containing data to be analyzed is used, otherwise MultiQC will throw an error. | yes | none |
ref | Reference genome used to align reads to during guided assembly | yes | resources/cov2.fa |
out_dir | User-specified directory to output WasteFlow results to. | no | data_dir/results |
combine_reps | Include this option to combine multiple sequencing replicates into one file containing all forward reads and one file containing all reverse reads. To merge replicates of the same sample, the replicate number must be present in the filename as -#- (ex. prefix-2-suf_fix.fq). | no | off |
trim_galore | Include this switch to use trim-galore to trim adapters/ low qual bases | no | fastp |
bwa | Include this switch to use bwa-mem to align reads to reference | no | minimap2 |
conda_cache | User-defined location to save conda env | no | ./work/conda |
adapters | Additional adapters to include during trimming with fastp | no | ${projectDir}/resources/cov2_adapter.fasta |
primers | Bed file containing primer scheme for trimming with iVar | yes | ${projectDir}/resources/articV5.3.bed |
primerless_reads | iVar trim include reads that are outside of primer regions/ no primers (ie. Nextera) | no | off |
primer_pairs | iVar trim specify path to primer pairs *.tsv file when amplicons are fragmented (ie. Nextera) | no | none |
skip_trim | Skip primer trimming (ie. probe-enrichment) | no | off |
ivar_flags | Additional options to pass to iVar during primer/ quality trimming | no | -m 30 -q 20 -s 4 |
freebayes | Include this switch to use freebayes to call variants | no | Freyja::iVar |
freeb_flags | Additional options to pass freebayes during variant calling | no | -p 1 --pooled-continuous --min-coverage 5 |
demixdepth | The minimum read depth for a site to be considered in Freyja demix. Collapses indistinguishable lineages. | no | 10 |
boot | Activate Freyja bootstrap estimates of each lineage (*_lineages.csv) & WHO VOI/VOC (*_summarized.csv) | no | off |
bootnum | Number of bootstrap replicates to perform for lineage abundance estimations | no | 100 |
rerun_lins | Search string providing path to previously generated vcf and depth files (ex. "/path/to/*{.txt,.tsv}"). Reruns Freyja demix command which is the lineage classificaion step. Useful for re-analyzing past samples after Freyja barcode has been updated. | no | off |
rerun_mut | User-defined pathway/search string used to collect past mutation tables. Must be quoted. | no | none |
mut_dir | User-defined path to save cumulative mutation table | no | none |
annotate_snps | This will produce a CSV table of annotated mutations for each sample. Currently variants are called by Freebayes and annotated with SnpEff. NOTE: this behaviour (ie. vcf2table process) is automatically active if you are using --rerun_mut. | no | off |
dt_threads | Set number of threads used by data.table. This is relevant to the mutation-monitoring processes in R and prevents a segfault error #1 | no | 60 |
version | Current WasteFlow version number | no | none |
help | Help on usage of WasteFlow | no | none |
WasteFlow was designed to allow multiple workflows for various pathogens. For example, reads can be processed with trim-galore or fastp, alignments can be generated with bwa or minimap2, variants can be called with iVar or Freebayes, and lineage calls made by Freyja can be performed on one directory or multiple. The modularity of Nextflow facilitates the addition of alternate packages. WasteFlow has currently been tested on SARS-CoV-2. Common workflows are outlined below:
The default workflow:
nextflow run BCCDC-PHL/WasteFlow -r v2.0.1 --data_dir /mandatory/path/to/ww/fastqs
flowchart TB
subgraph "INPUT"
v0["*_{R{1,2}_001,R{1,2},{1,2}}*
*.{fastq,fq,fastq.gz,fq.gz}"]
v2["ref.fasta"]
end
subgraph TREATMENT
v3([clean_fastp])
v5([qc_reads])
v9([align_minimap2])
v10([primer_trim])
v12([qc_align])
v15([var_call_freebayes])
v16([annotate_snpeff])
v1(( ))
v4["*{R1,R2}.trim.fastq.gz,*failed_reads.txt,
*fastp.html,*fastp.json"]
v7["*{R1,R2}.trim.fastq.gz"]
v11["*{R1,R2}.trim.fastq.gz"]
v14(( ))
v19(( ))
end
subgraph "OUTPUT"
v6["multiqc_report.html"]
v13["*.{depths,covstats}"]
v17["*.ann.vcf"]
v25["freyja_lineage_summary.tsv"]
v27["*_WasteFlow.log"]
end
subgraph EFFLUENT
v20([var_call_freyja])
v22([lineage_freyja])
v24([summarize_freyja])
v26([barcode_version])
v21["*.{txt,tsv}"]
v23["*.{tsv,yml}"]
end
v0 --> v1
v2 --> v7
v2 --> v11
v2 --> v14
v2 --> v19
v1 --> v3
v3 --> v4
v3 --> v7
v4 --> v5
v5 --> v6
v7 --> v9
v9 --> v10
v10 --> v11
v10 --> v14
v10 --> v19
v11 --> v12
v12 --> v13
v14 --> v15
v15 --> v16
v16 --> v17
v19 --> v20
v20 --> v21
v21 --> v22
v22 --> v23
v23 --> v24
v24 --> v25
v26 --> v27
The alternative workflow:
nextflow run BCCDC-PHL/WasteFlow -r v2.0.1 --data_dir /mandatory/path/to/ww/fastqs/ --out_dir /alternative/results/folder/ --bwa --trim_galore --freebayes --combine_reps --primerless_reads --primer_pairs /path/to/tsv/of/primerPairs.tsv --boot --bootnum 10 --demixdepth 100 --rerun_mut "/path/to/past/mut_tables/*{.csv}" --mut_dir /mandatory/path/to/cumulative/mut_table/output --rerun_lins "/path/to/past/freyja_var/outputs/*{.txt,.tsv}"
Note the quotation marks around --rerun_mut and --rerun_lins paths.
flowchart TB
subgraph "INPUT "
v0["_R1_*.{fastq,fq,fastq.gz,fq.gz}"]
v4["_R2_*.{fastq,fq,fastq.gz,fq.gz}"]
v10["ref.fasta"]
v26["previous Freyja
lineage *.tsv
+
read depth *.txt files
(per sample)"]
v43["previous annotated
mutation *.csv files
(per sample)"]
end
subgraph INFLUENT
v9([merge_reps])
v1(( ))
end
subgraph TREATMENT
v11([clean_trim_galore])
v13([qc_reads])
v17([align_bwa])
v18([primer_trim])
v20([qc_align])
v23([var_call_freebayes])
v24([annotate_snpeff])
v12["*_val_{1,2}.fq.gz,*fastqc.zip"]
v15["*_val_{1,2}.fq.gz"]
v19["*.sort.{bam, bam.bai}"]
v22(( ))
v31(( ))
end
subgraph "OUTPUT"
v14["multiqc_report.html"]
v21["*.{depths,covstats}"]
v40["freyja_lineage_summary.tsv"]
v42["*_WasteFlow.log"]
end
subgraph EFFLUENT
v32([read_depths])
v37([lineage_freyja])
v39([summarize_freyja])
v41([barcode_version])
v44([vcf2table])
v47([collectTables])
v27["*_split.vcf, *_depths.txt
+
previous"]
v38(( ))
v45["new *.csv files
+
previous"]
end
v0 --> v1
v4 --> v1
v1 --> v9
v9 --> v11
v10 --> v15
v10 --> v19
v10 --> v22
v10 --> v31
v11 --> v12
v11 --> v15
v12 --> v13
v13 --> v14
v15 --> v17
v17 --> v18
v18 --> v19
v18 --> v22
v18 --> v31
v19 --> v20
v20 --> v21
v22 --> v23
v23 --> v24
v23 --> v27
v24 --> v44
v26 --> v27
v31 --> v32
v32 --> v27
v27 --> v37
v37 --> v38
v38 --> v39
v39 --> v40
v41 --> v42
v43 --> v45
v44 --> v45
v45 --> v47
BCCDC workflow:
nextflow run BCCDC-PHL/WasteFlow -r v2.0.1 --data_dir /mandatory/path/to/ww/fastqs/ --combine_reps --primerless_reads --primer_pairs /path/to/tsv/of/primerPairs.tsv --rerun_lins "/path/to/past/freyja_var/outputs/*{.txt,.tsv}" --rerun_mut "/path/to/past/annotated_mutation/outputs/*.csv"
--mut_dir /path/to/cumulative/annotated/mutation/table/output
flowchart TB
subgraph "INPUT"
v0["_R1_*.{fastq,fq,fastq.gz,fq.gz}"]
v4["_R2_*.{fastq,fq,fastq.gz,fq.gz}"]
v10["ref.fasta"]
v26["previous Freyja
lineage *.tsv
+
read depth *.txt files
(per sample)"]
v41["previous annotated
mutation *.csv files
(per sample)"]
end
subgraph INFLUENT
v9([merge_reps])
v1(( ))
end
subgraph TREATMENT
v11([clean_fastp])
v13([qc_reads])
v17([align_minimap2])
v18([primer_trim])
v20([qc_align])
v23([var_call_freebayes])
v24([annotate_snpeff])
v12["*{R1,R2}.trim.fastq.gz,*failed_reads.txt,
*fastp.html,*fastp.json"]
v15["*{R1,R2}.trim.fastq.gz"]
v19["*.sort.{bam, bam.bai}"]
v22(( ))
v31(( ))
end
subgraph "OUTPUT"
v14["multiqc_report.html"]
v21["*.{depths,covstats}"]
v38["freyja_lineage_summary.tsv"]
v40["*_WasteFlow.log"]
end
subgraph EFFLUENT
v32([var_call_freyja])
v35([lineage_freyja])
v37([summarize_freyja])
v39([barcode_version])
v42([vcf2table])
v45([collectTables])
v27["*.tsv, *_depths.txt
+
previous"]
v36(( ))
v43["new *.csv files
+
previous"]
end
v0 --> v1
v4 --> v1
v1 --> v9
v9 --> v11
v10 --> v15
v10 --> v19
v10 --> v22
v10 --> v31
v11 --> v12
v11 --> v15
v12 --> v13
v13 --> v14
v15 --> v17
v17 --> v18
v18 --> v19
v18 --> v22
v18 --> v31
v19 --> v20
v20 --> v21
v22 --> v23
v23 --> v24
v24 --> v42
v26 --> v27
v31 --> v32
v32 --> v27
v27 --> v35
v35 --> v36
v36 --> v37
v37 --> v38
v39 --> v40
v41 --> v43
v42 --> v43
v43 --> v45
- Karthikeyan, S., Levy, J.I., De Hoff, P. et al. Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission. Nature 609, 101–108 (2022). https://doi.org/10.1038/s41586-022-05049-6
- Heng Li, Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, Volume 34, Issue 18, September 2018, Pages 3094–3100, https://doi.org/10.1093/bioinformatics/bty191
- Grubaugh, N.D., Gangavarapu, K., Quick, J. et al. An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biol 20, 8 (2019). https://doi.org/10.1186/s13059-018-1618-7
- Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu, fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560
- Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, 1000 Genome Project Data Processing Subgroup, The Sequence Alignment/Map format and SAMtools, Bioinformatics, Volume 25, Issue 16, August 2009, Pages 2078–2079, https://doi.org/10.1093/bioinformatics/btp352
- Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012
- A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672
- Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift", Cingolani, P., et. al., Frontiers in Genetics, 3, 2012.
- Philip Ewels, Måns Magnusson, Sverker Lundin, Max Käller, MultiQC: summarize analysis results for multiple tools and samples in a single report, Bioinformatics, Volume 32, Issue 19, October 2016, Pages 3047–3048, https://doi.org/10.1093/bioinformatics/btw354
- Andrews, S. (2010). FASTQC. A quality control tool for high throughput sequence data
- Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1), pp. 10-12. doi:https://doi.org/10.14806/ej.17.1.200