Skip to content

Commit

Permalink
Merge pull request #26 from fhdsl/add-ch3-input-json
Browse files Browse the repository at this point in the history
Add a test JSON to chapter 3
  • Loading branch information
caalo authored Feb 29, 2024
2 parents 8dc6a3d + cce1707 commit 6197482
Showing 1 changed file with 62 additions and 41 deletions.
103 changes: 62 additions & 41 deletions 03-first-task.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,10 @@ task BwaMem {
# basename() is a built-in WDL function that acts like bash's basename
String base_file_name = basename(input_fastq, ".fastq")
String ref_fasta_local = basename(ref_fasta)
String read_group_id = "ID:" + base_file_name
String sample_name = "SM:" + base_file_name
String platform = "illumina"
String platform_info = "PL:" + platform # Create the platform information
command <<<
set -eo pipefail
Expand All @@ -247,10 +251,10 @@ task BwaMem {
bwa mem \
-p -v 3 -t ~{threads} -M -R '@RG\tID:foo\tSM:foo2' \
"~{ref_fasta_local}" "~{input_fastq}" > "~{base_file_name}.sam"
samtools view -1bS -@ 15 -o "~{base_file_name}.aligned.bam" "~{base_file_name}.sam"
samtools sort -n -@ 15 -o "~{base_file_name}.sorted_query_aligned.bam" "~{base_file_name}.aligned.bam"
-p -v 3 -t ~{threads} -M -R '@RG\t~{read_group_id}\t~{sample_name}\t~{platform_info}' \
~{ref_fasta_local} ~{input_fastq} > ~{base_file_name}.sam
samtools view -1bS -@ 15 -o ~{base_file_name}.aligned.bam ~{base_file_name}.sam
samtools sort -@ 15 -o ~{base_file_name}.sorted_query_aligned.bam ~{base_file_name}.aligned.bam
>>>
}
Expand All @@ -263,7 +267,7 @@ The runtime attributes of a task tell the WDL executor important information abo
runtime {
memory: "48 GB"
cpu: 16
docker: "fredhutch/bwa:0.7.17"
docker: "ghcr.io/getwilds/bwa:0.7.17"
disks: "local-disk 100 SSD"
}
```
Expand Down Expand Up @@ -322,7 +326,6 @@ The outputs of a task are defined in the `output` section of your task. Typicall

```
output {
File analysisReadyBam = "~{base_file_name}.aligned.bam"
File analysisReadySorted = "~{base_file_name}.sorted_query_aligned.bam"
}
```
Expand All @@ -331,12 +334,11 @@ Another way of writing this is with string concatenation. This is equivalent to

```
output {
File analysisReadyBam = base_file_name + ".aligned.bam"
File analysisReadySorted = base_file_name + ".sorted_query_aligned.bam"
}
```

If the output was not in the working directory, we would need to change the output to point to the file's path relative to the working directory, such as `File analysisReadyBam = "some_folder/~{base_file_name}.aligned.bam"`.
If the output was not in the working directory, we would need to change the output to point to the file's path relative to the working directory, such as `File analysisReadySorted = "some_folder/~{base_file_name}.sorted_query_aligned.bam"`.

Below are some some additional ways you can handle task outputs.

Expand Down Expand Up @@ -407,48 +409,47 @@ We've now designed a bwa mem task that can run on essentially any backend that s
task BwaMem {
input {
File input_fastq
File ref_fasta
File ref_fasta_index
File ref_dict
File ref_amb
File ref_ann
File ref_bwt
File ref_pac
File ref_sa
referenceGenome refGenome
Int threads = 16
}
String base_file_name = basename(input_fastq, ".fastq")
String ref_fasta_local = basename(ref_fasta)
String ref_fasta_local = basename(refGenome.ref_fasta)
String read_group_id = "ID:" + base_file_name
String sample_name = "SM:" + base_file_name
String platform = "illumina"
String platform_info = "PL:" + platform # Create the platform information
command <<<
set -eo pipefail
mv "~{ref_fasta}" .
mv "~{ref_fasta_index}" .
mv "~{ref_dict}" .
mv "~{ref_amb}" .
mv "~{ref_ann}" .
mv "~{ref_bwt}" .
mv "~{ref_pac}" .
mv "~{ref_sa}" .
#can we iterate through a struct??
mv ~{refGenome.ref_fasta} .
mv ~{refGenome.ref_fasta_index} .
mv ~{refGenome.ref_dict} .
mv ~{refGenome.ref_amb} .
mv ~{refGenome.ref_ann} .
mv ~{refGenome.ref_bwt} .
mv ~{refGenome.ref_pac} .
mv ~{refGenome.ref_sa} .
bwa mem \
-p -v 3 -t ~{threads} -M -R '@RG\tID:foo\tSM:foo2' \
"~{ref_fasta_local}" "~{input_fastq}" > "~{base_file_name}.sam"
samtools view -1bS -@ 15 -o "~{base_file_name}.aligned.bam" "~{base_file_name}.sam"
samtools sort -n -@ 15 -o "~{base_file_name}.sorted_query_aligned.bam" "~{base_file_name}.aligned.bam"
-p -v 3 -t ~{threads} -M -R '@RG\t~{read_group_id}\t~{sample_name}\t~{platform_info}' \
~{ref_fasta_local} ~{input_fastq} > ~{base_file_name}.sam
samtools view -1bS -@ 15 -o ~{base_file_name}.aligned.bam ~{base_file_name}.sam
samtools sort -@ 15 -o ~{base_file_name}.sorted_query_aligned.bam ~{base_file_name}.aligned.bam
>>>
output {
File analysisReadyBam = "~{base_file_name}.aligned.bam"
File analysisReadySorted = "~{base_file_name}.sorted_query_aligned.bam"
}
runtime {
memory: "48 GB"
cpu: 16
docker: "fredhutch/bwa:0.7.17"
disks: "local-disk 100 SSD"
docker: "ghcr.io/getwilds/bwa:0.7.17"
}
}
```
Expand All @@ -461,10 +462,10 @@ For the workflow to actually "see" the task, the task will either need to be imp
```
version 1.0
workflow minidata_test_alignment {
workflow mutation_calling {
input {
# Sample info
File sampleFastq
File tumorFastq
# Reference Genome information
File ref_fasta
File ref_fasta_index
Expand All @@ -481,7 +482,7 @@ workflow minidata_test_alignment {
# Map reads to reference
call BwaMem {
input:
input_fastq = sampleFastq,
input_fastq = input_fastq,
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
ref_dict = ref_dict,
Expand Down Expand Up @@ -527,8 +528,10 @@ task BwaMem {
Int threads = 16
}
String base_file_name = basename(input_fastq, ".fastq")
String ref_fasta_local = basename(ref_fasta)
String read_group_id = "ID:" + base_file_name
String sample_name = "SM:" + base_file_name
String platform = "illumina"
String platform_info = "PL:" + platform # Create the platform information
command <<<
set -eo pipefail
Expand All @@ -543,7 +546,7 @@ task BwaMem {
mv "~{ref_sa}" .
bwa mem \
-p -v 3 -t ~{threads} -M -R '@RG\tID:foo\tSM:foo2' \
-p -v 3 -t ~{threads} -M -R '@RG\t~{read_group_id}\t~{sample_name}\t~{platform_info}' \
"~{ref_fasta_local}" "~{input_fastq}" > "~{base_file_name}.sam"
samtools view -1bS -@ 15 -o "~{base_file_name}.aligned.bam" "~{base_file_name}.sam"
samtools sort -n -@ 15 -o "~{base_file_name}.sorted_query_aligned.bam" "~{base_file_name}.aligned.bam"
Expand All @@ -556,12 +559,30 @@ task BwaMem {
runtime {
memory: "48 GB"
cpu: 16
docker: "fredhutch/bwa:0.7.17"
docker: "ghcr.io/getwilds/bwa:0.7.17"
disks: "local-disk 100 SSD"
}
}
```

## Testing your first task

To test your first task and your workflow, you should have expectation of output is. For this first `BwaMem` task, we just care that the BAM file is created with aligned reads. You can use `samtools view output.sorted_query_aligned.bam` to examine the reads and pipe it to wordcount `wc` to get the number of total reads. This number should be almost identical as the number of reads from your input FASTQ file if you run `wc input.fastq`. In other tasks, we might have a more precise expectation of what the output file should be, such as containing the specific somatic mutation call that we have curated.
To test your first task and your workflow, you should have expectation of output is. For this first `BwaMem` task, we just care that the BAM file is created with aligned reads. You can use `samtools view output.sorted_query_aligned.bam` to examine the reads and pipe it to wordcount `wc` to get the number of total reads. This number should be almost identical as the number of reads from your input FASTQ file if you run `wc input.fastq`. In other tasks, we might have a more precise expectation of what the output file should be, such as containing the specific somatic mutation call that we have curated.

Here is an example JSON with the test data needed to run this single-task workflow:
```
{
"mutation_calling.tumorFastq": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/wdl_101/HCC4006_final.fastq",
"mutation_calling.ref_fasta": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta",
"mutation_calling.ref_fasta_index": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.fai",
"mutation_calling.ref_dict": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.dict",
"mutation_calling.ref_pac": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.pac",
"mutation_calling.ref_sa": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.sa",
"mutation_calling.ref_amb": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.amb",
"mutation_calling.ref_ann": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.ann",
"mutation_calling.ref_bwt": "/fh/fast/paguirigan_a/pub/ReferenceDataSets/genome_data/human/hg19/Homo_sapiens_assembly19.fasta.bwt",
"mutation_calling.ref_name": "hg19"
}
```

If you are not running on the Fred Hutch HPC, you'll need to modify your JSON file to point to wherever you have the data files stored. You can download the same fastq we're using from [our sandbox repo](https://github.com/fhdsl/WDL-sandbox/tree/main/data), and the reference files can be generated via `samtools index` or [downloaded from the Broad Institute's mirror](https://data.broadinstitute.org/snowman/hg19/).

0 comments on commit 6197482

Please sign in to comment.