Skip to content

Commit

Permalink
Merge pull request #53 from fhdsl/formatting
Browse files Browse the repository at this point in the history
formatting ch 2
  • Loading branch information
caalo authored Apr 8, 2024
2 parents 86b5e9f + 01b3001 commit 01119b3
Showing 1 changed file with 21 additions and 16 deletions.
37 changes: 21 additions & 16 deletions 02-workflow-plan.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,21 @@ Our WDL guide will center around building a workflow from scratch. As we build o

## Somatic mutation calling workflow

The workflow used as the example here is tailored to detect somatic mutations in two tumor samples. Initially, the workflow takes as input FASTQ-formatted sequencing data from two tumor specimens and one normal sample (a single normal sample is used here, but typically each tumor might have its own associated normal). Subsequently, it aligns the FASTQ files of each sample with the human reference genome (hg19), proceeds to identify and mark PCR duplicates, and conducts base quality recalibration. Following these steps, the workflow engages in somatic mutation calling, operating in a paired mode, to pinpoint mutations unique to the tumor samples in comparison to the normal one. Concluding the process, the workflow undertakes the annotation of the identified mutations, enriching the dataset with additional insights into the nature of the mutations detected.
The workflow used as the example here is tailored to detect somatic mutations in two tumor samples. Initially, the workflow takes as input FASTQ-formatted sequencing data from two tumor specimens and one normal sample (a single normal sample is used here, but typically each tumor might have its own associated normal). Subsequently, it aligns the FASTQ files of each sample with the human reference genome (hg19), proceeds to identify and mark PCR duplicates, and conducts base quality recalibration. Following these steps, the workflow engages in somatic mutation calling, operating in a paired mode, to pinpoint mutations unique to the tumor samples in comparison to the normal one. Concluding the process, the workflow undertakes the annotation of the identified mutations, enriching the dataset with additional insights into the nature of the mutations detected.

The workflow diagram:

[![Workflow Diagram](assets/WDL_101_Course.png)](assets/WDL_101_Course.png)


The tasks involved:

|Task|Function|Inputs|Outputs|
|----|--------|------|-------|
|`BwaMem`|aligns the samples to the reference genome (hg19)|FASTA (`.fa`) file|`.bam` file|
|`MarkDuplicates|marks PCR duplicates|`.bam` file|marked `.bam` file|
|`ApplyBaseRecalibrator`|performs base quality recalibration|marked `.bam` file|`.bam` file|
|`Mutect2`|performs paired somatic mutation calling|`.bam` file|`.vcf` file|
|`annovar`|annotates the called somatic mutations|`.vcf` file with somatic mutations|annotated `.vcf` file|

| Task | Function | Inputs | Outputs |
|-----------------|---------------------|-----------------|-------------------|
| `BwaMem` | aligns the samples to the reference genome (hg19) | FASTA (`.fasta`) file | `BAM (.bam)` file |
| `MarkDuplicates` | marks PCR duplicates | BAM (`.bam)` file | BAM (`.bam)` file |
| `ApplyBaseRecalibrator` | performs base quality recalibration | BAM (`.bam)` file | BAM (`.bam)` file |
| `Mutect2` | performs paired somatic mutation calling | BAM (`.bam)` file | VCF (`.vcf)` file |
| `annovar` | annotates the called somatic mutations | VCF (`.vcf)` file | VCF (`.vcf)` file |

## Workflow testing strategy

Expand All @@ -44,21 +42,28 @@ Here are some guidelines for any form of testing:

- The data you use for testing is ideally open access so others can verify your workflow also.

## Test samples

## Test samples
To serve as an example we use here whole exome sequencing data from three cell lines from the [Cancer Cell Line Encyclopedia](https://pubmed.ncbi.nlm.nih.gov/31068700/).
To serve as an example we use here whole exome sequencing data from three cell lines from the [Cancer Cell Line Encyclopedia](https://pubmed.ncbi.nlm.nih.gov/31068700/).

### Tumor 1 : HCC4006
HCC4006 is a lung cancer cell line that has a mutation in the gene *EGFR* (Epithelial Growth Factor Receptor), a proto-oncogene. Mutations in *EGFR* result in the abnormal constitutive activation of the EGFR signaling pathway and drive cancer. In this cell-line specifically, the *EGFR* mutation is an in-frame deletion in Exon 19. This mutation results in the constitutive activation of the EGFR protein and is therefore oncogenic.

HCC4006 is a lung cancer cell line that has a mutation in the gene *EGFR* (Epithelial Growth Factor Receptor), a proto-oncogene. Mutations in *EGFR* result in the abnormal constitutive activation of the EGFR signaling pathway and drive cancer. In this cell-line specifically, the *EGFR* mutation is an in-frame deletion in Exon 19. This mutation results in the constitutive activation of the EGFR protein and is therefore oncogenic.

### Tumor 2 : CALU1
CALU1 is a lung cancer cell line that has a mutation in the gene *KRAS* (Kirsten rat sarcoma viral oncogene homolog) . *KRAS* is also a proto-oncogene and the most common cancer-causing mutations lock the protein in an active conformation. Constitutive activation of *KRAS* results in carcinogenesis. In this cell-line *KRAS* has a point/missense mutation resulting in the substitution of the amino acid glycine (G) with cysteine (C) at position 12 of the KRAS protein (commonly known as the KRAS G12C mutation). This mutation results in the constitutive activation of KRAS and drives carcinogenesis.

CALU1 is a lung cancer cell line that has a mutation in the gene *KRAS* (Kirsten rat sarcoma viral oncogene homolog) . *KRAS* is also a proto-oncogene and the most common cancer-causing mutations lock the protein in an active conformation. Constitutive activation of *KRAS* results in carcinogenesis. In this cell-line *KRAS* has a point/missense mutation resulting in the substitution of the amino acid glycine (G) with cysteine (C) at position 12 of the KRAS protein (commonly known as the KRAS G12C mutation). This mutation results in the constitutive activation of KRAS and drives carcinogenesis.

### Normal : MOLM13

MOLM 13 is a human leukemia cell line commonly used in research. While it is also a cancer cell line for the purposes of this workflow example we are going to consider it as a "normal". This cell line does not have mutations in *EGFR* nor in *KRAS* and therefore is a practical surrogate in lieu of a conventional normal sample

### Test data details

Fastq files for all these three samples were derived from their respective whole exome sequencing. However, for the purpose of this guide we have limited the sequencing reads to span +/- 200 bp around the mutation sites for both genes. In doing so we are able to shrink the data files for quick testing.
Fastq files for all these three samples were derived from their respective whole exome sequencing. However, for the purpose of this guide we have limited the sequencing reads to span +/- 200 bp around the mutation sites for both genes. In doing so we are able to shrink the data files for quick testing.

<iframe src="https://docs.google.com/forms/d/e/1FAIpQLSeEKGWTJOowBhFlWftPUjFU8Rfj-d9iXIHENyd8_HGS8PM7kw/viewform?embedded=true" width="640" height="886" frameborder="0" marginheight="0" marginwidth="0">
Loading...
<iframe src="https://docs.google.com/forms/d/e/1FAIpQLSeEKGWTJOowBhFlWftPUjFU8Rfj-d9iXIHENyd8_HGS8PM7kw/viewform?embedded=true" width="640" height="886" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe>
</iframe>

0 comments on commit 01119b3

Please sign in to comment.