Skip to content

Commit

Permalink
Merge branch 'dev' of https://github.com/ARUP-NGS/BMFtools into dev
Browse files Browse the repository at this point in the history
  • Loading branch information
dnephi committed May 20, 2016
2 parents 4431426 + 61053a3 commit 9a2038e
Show file tree
Hide file tree
Showing 5 changed files with 84 additions and 44 deletions.
110 changes: 70 additions & 40 deletions MANUAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,36 +4,53 @@ BMFtools (**B**arcoded **M**olecular **F**amilies tools) is a suite of tools for
Reads which are PCR duplicates of original template molecules are **molecularly** demultiplexed into single unique observations for each sequenced founding template molecule.
Accessory tools provide postprocessing, filtering, quality control, and summary statistics.

### Workflow & Typical Use
##Index

[Workflow](#workflow)

[Tools](#tools)

[Usage](#usage)

See also [Vignettes](https://github.com/ARUP-NGS/BMFtools/blob/dev/Vignettes.md) in a separate document for use cases.

###Workflow

Reads are originally collapsed into single observations per barcode using exact matching. Following this step, to correct for errors in barcode reading, positional information can be used to rescue reads into unique single observations.

The only difference between inline and secondary index chemistry workflows is the initial `bmftools dmp` call.

####Exact-Matching Fastq Consolidation
A typical paired-end exact fastq-stage molecular demultiplexing call:

> Inline
`bmftools dmp -s <homing_sequence> -l <barcode_length> -o <temporary_file_prefix> -p <threads> -f <final_output_prefix> <r1.fq.gz> <r2.fq.gz>`
```bmftools dmp -s <homing_sequence> -l <barcode_length> -o <temporary_file_prefix> -p <threads> -f <final_output_prefix> <r1.fq.gz> <r2.fq.gz>```
> Secondary Index
`bmftools sdmp -o <temporary_file_prefix> -p <threads> -f <final_output_prefix> -i <index.fq.gz> <r1.fq.gz> <r2.fq.gz>`
```bmftools sdmp -o <temporary_file_prefix> -p <threads> -f <final_output_prefix> -i <index.fq.gz> <r1.fq.gz> <r2.fq.gz>```

Each of these produces `final_output_prefix.R1.fq` and `final_output_prefix.R2.fq`. (A .gz suffix is appended if the output is gzip compressed.)

Barcode metadata is written into fastq comments in SAM auxiliary tag format. These reads are then aligned with
bwa mem with the -C option, which appends the fastq comment to the end of the sam record. This trivially adds
tags to all alignments for each read.

####Alignment

`bwa mem -CYT0 -t<threads> <idx.base> final_output_prefix.R1.fq final_output_prefix.R2.fq | samtools view -bho final_output.bam`


####Rescue
Because errors occur in reading barcodes, this initial exact-matching step is not completely successful in grouping
reads from the same original template molecule. To account for this, an optional rescue protocol has been implemented.

This rescue takes place in two steps -- first, a sort which groups together based on alignment signature, and second,
collapsing reads sharing these signatures with similar barcodes into single observations.

Alignment signatures consist of a read and its mate's alignment information. These can be grouped by start position
or by unclipped start position. Unclipped start position is likely less sensitive to errors in the reads, whereas
a bam sorted by signature using position can still be indexed for traditional use. Unclipped start position therefore
requires an additional sorting step.
"Alignment signatures" consist of a read and its mate's alignment information, if paired. These can be grouped by start position
or by unclipped start position. Unclipped start position is less sensitive to errors in the reads, whereas
a bam sorted by signature using position can still be indexed for traditional use. Unclipped start position comes at the computational cost of an additional sort but with potentially increased success in rescue.

Because reads need both their and their mates' alignment information, including read length, the preprocessing
`bmftools mark` is required prior to bmftools sort.
Expand All @@ -49,23 +66,29 @@ If secondary and supplementary alignments are needed for other reads,these shoul
`bmftools mark -l0 final_output.bam | sort -k <ucs/bmf> -o <final_output_prefix.bmfsort.bam> -`

For position:

`bmftools rsq -f<tmp.fq> <final_output_prefix.bmfsort.bam> <final_output_prefix.tmprsq.bam>`

For unclipped start:

`bmftools rsq [-u <unclipped start only>] -f<tmp.fq> <final_output_prefix.bmfsort.bam> - | samtools sort -O bam -T<tmp_prefix> -ofinal_output_prefix.tmprsq.bam`


Realigned reads are then sorted and merged in with the other reads in the dataset.

~bwa mem -pCYT0 -t<threasd> <reference> -f<tmp.fq> | bmftools mark | samtools sort -l 0 -Obam -T <tmp_prefix> | samtools merge -cpfh final_output_prefix.tmprsq.bam final_output_prefix.rsqmerged.bam final_output_prefix.tmprsq.bam -`
`bwa mem -pCYT0 -t<threasd> <reference> -f<tmp.fq> | bmftools mark | samtools sort -l 0 -Obam -T <tmp_prefix> | samtools merge -cpfh final_output_prefix.tmprsq.bam final_output_prefix.rsqmerged.bam final_output_prefix.tmprsq.bam -`

For efficiency, this can be heavily piped to reduce I/O and unnecessary compression/decompression.
At this point, final_output_prefix.tmprsq.bam contains supplementary and secondary alignments for all template molecules, and reads have been rescued using alignment information.

####Post-rescue: Now what?

If you're not interested in taking advantage of the barcode metadata (new p values, family sizes, &c.), this bam is ready for downstream analysis.

For more specific use cases, postprocessing steps (such as cap and filter) can be used to prepare a bam for use by BMF-agnostic tools, and variant calling can be performed or vetted with `bmftools stack` or `bmftools vet`.

For translocation detection, soft-clippings are often used as markers for potential events. To improve performance of such tools, (e.g., WHAM), if adapters sequences have been masked, successive masked bases at the ends of reads can be removed by [maskripper](https://github.com/noseatbelts/maskripper). For rescue to work properly, however, this should be performed only after rescue has been completed.

##Tools


Expand Down Expand Up @@ -146,15 +169,15 @@ bmftools <subcommand> <-h>

`bmftools vet -o output.vcf -b capture.bed --min-family-size 3 input.bcf input.bam`

##Commands and options
##Usage

### Core Functionality

####<b>dmp</b>
Description:
> Performs molecular demutiplexing of inline barcoded fastq data.
> The homing sequence is a sequence of bases marking the end of the random nucleotides
> which make up the barcode. This is required to ensure that chemistr has worked as expected
> which make up the barcode. This is required to ensure that chemistry has worked as expected
> and to identify the barcode length, which is used for additional entropy.
> To achieve linear performance with arbitrarily large datasets, an initial marking step subsets the reads by the first
Expand Down Expand Up @@ -267,28 +290,29 @@ bmftools <subcommand> <-h>

TODO: Fill in details on these tags.
VCF Header Fields:
1. BMF_PASS
2. ADP
3. ADPO
3. ADPD
4. ADPR
5. RVF
6. QSS
7. AMBIG
8. SOMATIC_PV
9. SOMATIC_CALL
10. SOMATIC
11. FR_FAILED
12. FM_FAILED
13. FP_FAILED
14. AF_FAILED
15. MQ_FAILED
16. IMPROPER
17. OVERLAP
* BMF_PASS: 1 if variant passes, 0 otherwise.
* ADP: Number of unique observations for allele.
* ADPO: Number of overlapping read pair observations for allele.
* ADPD: Number of duplex observations for allele.
* ADPR: Number of original reversed observations for allele.
* RVF: Fraction of RV observations supporting allele.
* QSS: Q Score Sum supporting allele.
* AMBIG: Number of ambiguous base calls at position.
* SOMATIC_CALL: Boolean value for a somatic call for allele.
* FR_FAILED: Number of observations failed by fraction of family members agreed on a base call per sample.
* FM_FAILED: Number of observations failed for insufficient family size per sample.
* FP_FAILED: Number of observations failed for failing barcode QC per sample.
* AF_FAILED: Number of observations failed for insufficient aligned fraction.
* MQ_FAILED: Number of observations failed for insufficient mapping quality.
* IMPROPER: Number of observations failed for being in an improper pair.
* OVERLAP: Number of overlapping read pairs at position.
* AFR: Allele Fractions per allele, including reference.

####<b>vet</b>
Description:
> Curates variant calls from a bcf file and an associated, indexed bam file.
> Overlapping reads in a pileup are treated as two observations and are merged, if agreed, using Fisher's method.
> Discordant unambiguous base calls are masked. Ambiguous base calls are overridden by an unambiguous call from its mate.

Options:
> -b, --bed-path: Path to bed file for anaylsis. REQUIRED.
Expand All @@ -307,17 +331,15 @@ bmftools <subcommand> <-h>
> -q, --skip-qc-fail: Skip reads marked as QC fail.
> -F, --skip-recommended: Skip secondary, supplementary, and PCR duplicates.
> -B, --emit-bcf-format: Emit bcf-formatted output instead of vcf.
> -w, --write-outside-bed: Write variants outside of bed region unmodified rather than removing.

TODO: Fill in details on these tags.
VCF Header Fields:
1. BMF_VET
2. BMF_UNIOBS
3. BMF_DUPLEX
4. BMF_FAIL
5. DUPLEX_DEPTH
6. DISC_OVERLAP
7. OVERLAP
* BMF_VET: 1 if a variant passes, 0 otherwise.
* BMF_UNIOBS: Number of unique observations supporting a variant at that position.
* BMF_DUPLEX: Number of duplex observations supporting a variant at that position.
* BMF_FAIL: NUmber of reads at position failing filters.
* DUPLEX_DEPTH: Number of duplex reads at position passing filters.
* DISC_OVERLAP: Number of read pairs at position with discordant base calls.
* OVERLAP: Number of overlapping read pairs combined into single observations at position.


### Manipulation
Expand Down Expand Up @@ -383,6 +405,7 @@ bmftools <subcommand> <-h>
Usage: bmftools target <opts> <in.bam>

Options:

> -b: Path to bed. REQUIRED.
> -m: Set minimum mapping quality for inclusion.
> -p: Set padding - number of bases around target region to consider as on-target. Default: 0.
Expand All @@ -404,6 +427,7 @@ bmftools <subcommand> <-h>
Usage: bmftools err main <opts> <reference.fasta> <in.csrt.bam>

Options:

> -o: Path to output file. Set to '-' or 'stdout' to emit to stdout.
> -a: Set minimum mapping quality for inclusion.
> -S: Set minimum calculated PV tag value for inclusion.
Expand All @@ -426,6 +450,7 @@ bmftools <subcommand> <-h>
Usage: bmftools err fm <opts> <reference.fasta> <in.csrt.bam>

Options:

> -o: Path to output file. Set to '-' or 'stdout' to emit to stdout.
> -h/-?: Print usage.
> -S: Set minimum calculated PV tag value for inclusion.
Expand All @@ -441,6 +466,7 @@ bmftools <subcommand> <-h>
Usage: bmftools err region <opts> <reference.fasta> <in.csrt.bam>

Options:

> -b: Path to bed file. REQUIRED.
> -o: Path to output file. Leave unset or set to '-' or 'stdout' to emit to stdout.
> -a: Set minimum mapping quality for inclusion.
Expand All @@ -461,6 +487,7 @@ bmftools <subcommand> <-h>
Usage: bmftools famstats fm <opts> <in.bam>

Options:

> -m: Set minimum mapping quality. Default: 0.
> -f: Set minimum family size. Default: 0.

Expand All @@ -479,6 +506,7 @@ bmftools <subcommand> <-h>
> Essentially a modification of samtools sort.
Options:

> -l INT Set compression level, from 0 (uncompressed) to 9 (best)
> -m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
> -k Sort key - pos for positional (samtools default), qname for query name, bmf for extended positional, ucs for using unclipped mate start/stop positions. Default: bmf comparison.
Expand All @@ -500,6 +528,7 @@ bmftools <subcommand> <-h>
Usage: bmftools mark <opts> <input.namesrt.bam> <output.bam>

Options:

> -l: Sets bam compression level. (Valid: 1-9). Default: 0.
> -q: Skip read pairs which fail.
> -d: Set bam compression level to default (6).
Expand All @@ -519,7 +548,6 @@ bmftools <subcommand> <-h>
> In addition, use of rescaled quality scores is not yet supported.
> RAM-hungry but fast. Useful for relatively small datasets.

Usage: `bmftools inmem <options> input_R1.fastq.gz input_R2.fastq.gz`

Options:
Expand All @@ -537,14 +565,16 @@ bmftools <subcommand> <-h>

####<b>hashdmp</b>
Description:

> Contains the hashmap-powered consolidation only. Its input is the preprocessed marked bams produced
> by bmftools dmp and sdmp.
> Molecularly demultiplexes marked temporary fastqs into final unique observation records.
> bmftools hashdmp does so in one large hashmap. This may require huge amounts of memory.
Usage: `bmftools hashdmp <opts> <input.marked.fq>`

Options:
Usage: `bmftools hashdmp <opts> <input.marked.fq>`

Options:

> -s: Perform secondary index consolidation rather than Loeb-like inline consolidation.
> -o: Write to <parameter> rather than stdout.
1 change: 1 addition & 0 deletions Vignettes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
TODO: *this;
2 changes: 1 addition & 1 deletion src/bmf_depth.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -345,7 +345,7 @@ namespace BMF {
for(p = str.s, i = 0; i < 2 && p < str.s + str.l;*p++ == '\t' ? ++i: 0);
if(p != str.s + str.l) --p;
str.l = p - str.s;
LOG_INFO("Processing region #%i \"%s\". %0.4f%% Complete: %i of %i lines.\n", lineno, str.s, (lineno * 100.) / n_lines, lineno, n_lines);
LOG_DEBUG("Processing region #%i \"%s\". %0.4f%% Complete: %i of %i lines.\n", lineno, str.s, (lineno * 100.) / n_lines, lineno, n_lines);

for(i = 0; i < n; ++i) {
kputc('\t', &str);
Expand Down
10 changes: 8 additions & 2 deletions test/dmp/hashdmp_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,20 @@ def main():
cstr = "../../%s hashdmp -o hashdmp_test.out hashdmp_test.fq" % ex
subprocess.check_call(shlex.split(cstr))
fqh = pysam.FastqFile("hashdmp_test.out")
r1 = fqh.next()
try:
r1 = fqh.next()
except:
r1 = next(fqh) # Python 3
tags = get_tags(r1)
assert tags["FM"] == 7
assert round(tags["NF"], 2) == 0.14
assert tags["RV"] == 2
assert tags["DR"]
assert len(r1.name) == 16
r1 = fqh.next()
try:
r1 = fqh.next()
except:
r1 = next(fqh) # Python 3
tags = get_tags(r1)
assert tags["FM"] == 1
assert tags["FP"] == 0
Expand Down
5 changes: 4 additions & 1 deletion test/rsq/rsq_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,10 @@
correct_string = "@CCATAATAACGCCAGTAT PV:B:I,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,104,98,78,93,104,98,79,104,79,78,91,93,102,91,78,79,93,93,98,79,104,104,104,93,93,79,79,79,93,93,93,104,104,79,79,98,104,104,104,104,98,102,78,79,79,93,79,93,96,79,91,102,98,93,79,93,93,78,91,91,93,98,78,79,91,91,91,78,79,79,104,98,102,93,93,96,91,93,93,98,79,93,79,91,104,76,76,78,104,79,93,93,79,78,91,78,79,91,78,79,93,102,104,104,102,79,91,91,104,61,65,65,67,67,67,67,67,67,67,67,26\tFA:B:I,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3\tFM:i:3\tFP:i:1\tRV:i:1\tNC:i:0\tNP:i:2\tDR:i:1\nNNNNNNNNNNNNNNNNAGCCTTGTGTTTCTGACAATATATTCTTCAACAGCAGCTAGAAAGTTGGTTCAAACCAACTTTTAATATACAGTAGTTCTTTTCATTTACATTTCAAAATATTTAACAAAGTCAAACTTTC\n+\n################IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGIIIIIIIIIIIIIIIIIIIIIIIGGIIIIIIII;"
def main():
subprocess.check_call("../../bmftools_db rsq -ftmp.fq rsq_test.bam rsq_test.out.bam 2> rsq_test.log", shell=True)
assert(subprocess.check_output("samtools view -c rsq_test.out.bam", shell=True).strip() == "0")
try:
assert(subprocess.check_output("samtools view -c rsq_test.out.bam", shell=True).strip() == "0")
except AssertionError:
assert(subprocess.check_output("samtools view -c rsq_test.out.bam", shell=True).strip().decode() == "0")
recs = list(pysam.FastqFile("tmp.fq"))
assert len(recs) == 2
try:
Expand Down

0 comments on commit 9a2038e

Please sign in to comment.