diff --git a/.travis.yml b/.travis.yml index e4327741..5f2121db 100644 --- a/.travis.yml +++ b/.travis.yml @@ -1,7 +1,7 @@ # # Using sudo-false/container-based tests for greater (linux) test responsiveness. This doesn't seem -# to effect the queing time for OSX tests. +# to effect the queueing time for OSX tests. # dist: trusty diff --git a/CHANGELOG.md b/CHANGELOG.md index 53fa35a1..5ba7a2f8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,27 @@ - Add a configurable option to allow overlapping pairs to be used as evidence (MANTA-1398) - The option is available in the configure file configureManta.py.ini +### Changed +- Change SV candidate contig aligners to improve precision (MANTA-1396) + - Change contig aligners such that variant occurrences are more heavily penalized. +- Fix multi-junction nomination (MANTA-1430) + - Complex events with more than two junctions are no longer nominated as a group + - Fix the problem of duplicate detection of the same SV candidate +- Add index to ensure uniqueness of evidence bam filenames (MANTA-1431) + - It solves the potential problem of name conflicts for evidence bams if the input bam files have the same name while located in different directories. +- Change filters for easy interpretation of multi-sample germline variant vcf (MANTA-1343) + - Add record-level filter 'SampleFT' when no sample passes all sample level filters + - Add sample-level filter 'HomRef' for homogyzous reference calls + - No more sample-level filter will be applied at the record level even if it applies to all samples +- Change representation of inversions in the VCF output (MANTA-1385) + - Intrachromosomal translocations with inverted breakpoints are now reported as two breakend (BND) records. + - Previously they were reported in the VCF using the inversion (INV) allele type. + +### Fixed +- Fix the bug of stats generation with short reference sequences (MANTA-1459/[#143]) +- Fix the evidence significance test in the multi-sample calling mode (MANTA-1294) + - This issue previously caused spurious false negatives during the multi-sample calling mode. The incidence rate of the problem tended to increase with sample count. + ## v1.4.0 - 2018-04-25 This is a major bugfix update from v1.3.2, featuring improved precision and vcf representation, in addition to minor user friendly improvements. diff --git a/README.md b/README.md index 87837006..b82566ae 100644 --- a/README.md +++ b/README.md @@ -31,8 +31,8 @@ indels for germline and cancer sequencing applications. *Bioinformatics*, ...and the corresponding [open-access pre-print][preprint]. -[bpaper]:https://dx.doi.org/10.1093/bioinformatics/btv710 -[preprint]:http://dx.doi.org/10.1101/024232 +[bpaper]:https://doi.org/10.1093/bioinformatics/btv710 +[preprint]:https://doi.org/10.1101/024232 License diff --git a/docs/developerGuide/README.md b/docs/developerGuide/README.md index 3802deca..d90c2c42 100644 --- a/docs/developerGuide/README.md +++ b/docs/developerGuide/README.md @@ -24,7 +24,7 @@ Manta Developer Guide * [Commit messages](#commit-messages) * [Commit consolidation](#commit-consolidation) * [Changelog conventions](#changelog-conventions) -* [Branching and release tagging guidelines](#branching-and-release-tagging-guidelines) + * [Branching and release tagging guidelines](#branching-and-release-tagging-guidelines) * [Error handling](#error-handling) * [General Policies](#general-policies) * [Exception Details](#exception-details) @@ -240,7 +240,7 @@ prior to merging the branch. longer, for instance by starting all major bullet points with an imperitive verb. -## Branching and release tagging guidelines +### Branching and release tagging guidelines All features and bugfixes are developed on separate branches. Branch names should contain the corresponding JIRA ticket id or contain the key "github${issueNumber}' to refer to the corresponding issue on github.com. After code diff --git a/docs/methods/primary/methods.tex b/docs/methods/primary/methods.tex index 0f998ef8..9cec6048 100644 --- a/docs/methods/primary/methods.tex +++ b/docs/methods/primary/methods.tex @@ -350,7 +350,7 @@ \subsubsection{Contig assembly} Finally, a greedy procedure is applied to select the constructed contigs in the order of the number of effective supporting reads and contig length. An effective supporting read cannot be a psuedo read, nor support any contigs that have been selected previously. The selection process is repeated until there is no more contig available with the minimum number of effective supporting reads (defaults to 2), or the maximum number of assembled contigs (defaults to 10) is met. -\subsubsection{Contig alignment for large SVs} For large SV candidates spanning two distinct regions of the genome, the reference sequences are extracted from the two expected breakend regions, and the order and/or orientation of the references is adjusted such that if the candidate SV exists, the left-most segment of the SV contig should align to the first transformed reference region and the right-most contig segment should align to the second reference region. The contig is aligned across the two reference regions using a variant of Smith-Waterman-Gotoh alignment (\cite{smith1981,gotoh1982}) where a `jump' state is included which can only be entered from the match state for the first reference segment and only exits to the match or insert states of the second reference segment. The state transitions of this alignment scheme are shown in Figure \ref{fig:jumpstate} +\subsubsection{Contig alignment for large SVs} For large SV candidates spanning two distinct regions of the genome, the reference sequences are extracted from the two expected breakend regions, and the order and/or orientation of the references is adjusted such that if the candidate SV exists, the left-most segment of the SV contig should align to the first transformed reference region and the right-most contig segment should align to the second reference region. The contig is aligned across the two reference regions using a variant of Smith-Waterman-Gotoh alignment (\cite{smith1981,gotoh1982}) where a `jump' state is included which can only be entered from the match state for the first reference segment and only exits to the match or insert states of the second reference segment. The state transitions of this alignment scheme are shown in Figure \ref{fig:jumpstate}. \begin{figure}[!tpb] \centerline{ @@ -362,15 +362,15 @@ \subsubsection{Contig alignment for large SVs} For large SV candidates spanning \label{fig:jumpstate} \end{figure} -The alignment scores used for each reference segment are (2,-8,-12,-1) for match, mismatch, gap open and gap extend. Switching between insertion and deletion states is allowed at no cost. Scores to transition into and extend the 'jump' state are -24 and 0, respectively. The jump state is entered from any point in reference segment 1 and exits to any point in reference segment 2. The alignments resulting from this method are only used when a transition through the jump state occurs. In addition, each of the two alignment segments flanking the jump state are required to extend at least 30 bases with an alignment score no less than 75\% of the perfect match score for the flanking alignment segment. If more than one contig meets all quality criteria the contig with the highest alignment score is selected. When a contig and alignment meet all quality criteria, the reference orientation and ordering transformations applied before alignment are reversed to express the refined basepair-resolution structural variant candidate in standard reference genome coordinates. +The alignment scores used for each reference segment are (2,-8,-12,-1) for match, mismatch, gap open and gap extend. Switching between insertion and deletion states is allowed at no cost. Scores to transition into and extend the 'jump' state are -100 and 0, respectively. The jump state is entered from any point in reference segment 1 and exits to any point in reference segment 2. The alignments resulting from this method are only used when a transition through the jump state occurs. In addition, each of the two alignment segments flanking the jump state are required to extend at least 30 bases with an alignment score no less than 75\% of the perfect match score for the flanking alignment segment. If more than one contig meets all quality criteria, the contig with the highest alignment score is selected. When a contig and alignment meet all quality criteria, the reference orientation and ordering transformations applied before alignment are reversed to express the refined basepair-resolution structural variant candidate in standard reference genome coordinates. \subsubsection{Contig alignment for complex region candidates} -Complex regions are segments of the genome targeted for assembly without a specific variant hypothesis. For this reason the problem of aligning contigs for these regions is somewhat more difficult than for specific large SV candidates, because a wide range of variant sizes are possible. This is reflected in the alignment procedure for complex region contigs, which are checked against two aligners optimized for large and small indels respectively. +Complex regions are segments of the genome targeted for assembly without a specific variant hypothesis. For this reason the problem of aligning contigs for these regions is somewhat more difficult than for specific large SV candidates, because a wide range of variant sizes are possible. This is reflected in the indel aligner that handles both small and large indels. -A contig is first aligned with the large indel aligner and only checked for small indels if no large indels are found. The structure of the large indel aligner is a variant on a standard affine-gap scheme, in which a second pair of delete and insert states are added for large indels. Alignment scores for standard alignment states are (2, -8, -18, -1) for match, mismatch, gap open, and gap extend. Open and extend scores for 'large' gaps are -24 and 0. Transitions are allowed between standard insertions and deletions but disallowed between the large indel states. Variants are only reported from the large indel aligner if an insertion of at least 80 bases or a deletion of at least 200 bases is found. The flanking alignment quality criteria described above for large SVs is also applied to filter out noisy alignments. To reduce false positive calls in repetitive regions an additional filter is applied to complex region candidates: the left and right segments of the contig flanking a candidate indel are checked for uniqueness in the local reference context. Contig alignments are filtered out if either of the two flanking contig segments can be aligned equally well to multiple locations within 500bp of the target reference region. +The indel aligner is a variant on a standard affine-gap scheme, in which a second pair of delete and insert states are added for large indels. Alignment scores for standard alignment states are (2, -8, -24, -1) for match, mismatch, gap open, and gap extend. Open and extend scores for 'large' gaps are -100 and 0. Transitions are allowed between standard insertions and deletions but disallowed between the large indel states. -If the large indel aligner fails to identify a candidate meeting the size and quality criteria above, the contig is used to search for smaller indels, this time using a conventional affine gap aligner with parameters: (2,-8,-12,0) for match, mismatch, gap open, gap extend. All indels larger than the minimum indel size are identified. For each indel, the flanking contig alignment quality and uniqueness checks described above are applied to filter likely false positives, and any remaining cases become small indel candidates. +All indels larger than the minimum indel size are identified by the indel aligner. For each indel, the flanking alignment quality criteria described above for large SVs is also applied to filter out noise alignments. To further reduce false positive calls in repetitive regions, an additional filter is applied to complex region candidates: the left and right segments of the contig flanking a candidate indel are checked for uniqueness in the local reference context. Contig alignments are filtered out if either of the two flanking contig segments can be aligned equally well to multiple locations within 500bp of the target reference region. Among contigs meeting all quality criteria, the ones with 'large' gaps are prioritized during contig selection. If there are more than one contig with 'large' gaps, or if all contigs have no 'large' gap, the contig with the highest alignment score is selected. \subsubsection{Large Insertions} diff --git a/docs/userGuide/README.md b/docs/userGuide/README.md index ac6526bd..ccce7e46 100644 --- a/docs/userGuide/README.md +++ b/docs/userGuide/README.md @@ -14,7 +14,7 @@ Manta User Guide * [Input requirements](#input-requirements) * [Outputs](#outputs) * [Structural Variant predictions](#structural-variant-predictions) - * [Manta VCF reporting format](#manta-vcf-reporting-format) + * [Manta VCF interpretation](#manta-vcf-interpretation) * [VCF Sample Names](#vcf-sample-names) * [Small indels](#small-indels) * [Insertions with incomplete insert sequence assembly](#insertions-with-incomplete-insert-sequence-assembly) @@ -22,8 +22,9 @@ Manta User Guide * [VCF INFO Fields](#vcf-info-fields) * [VCF FORMAT Fields](#vcf-format-fields) * [VCF FILTER Fields](#vcf-filter-fields) - * [How to interpret VCF filters?](#how-to-interpret-vcf-filters) - * [What do the values in Manta's VCF ID field mean?](#what-do-the-values-in-mantas-vcf-id-field-mean) + * [Interpretation of VCF filters](#interpretation-of-vcf-filters) + * [Interpretation of Manta's INFO/EVENT field](#interpretation-of-mantas-infoevent-field) + * [Details of Manta's VCF ID field](#details-of-mantas-vcf-id-field) * [Converting Manta VCF to BEDPE format](#converting-manta-vcf-to-bedpe-format) * [Statistics](#statistics) * [Runtime hardware requirements](#runtime-hardware-requirements) @@ -289,7 +290,7 @@ For tumor-only analysis, Manta will produce an additional VCF: counts for each allele (2) a subset of the filters from the scored tumor-normal model are applied to the single tumor case to improve precision. -### Manta VCF reporting format +### Manta VCF interpretation Manta VCF output follows the VCF 4.1 spec for describing structural variants. It uses standard field names wherever possible. All custom @@ -336,13 +337,17 @@ chr1 11830208 MantaINS:1577:0:0:0:3:0 T 999 PASS #### Inversions -Inversions are reported as a single inverted sequence junction. As described in the [VCF INFO Fields](#vcf-info-fields) below, the INV3 tag indicates inversion breakends open at the 3' of reported location, whereas the INV5 tag indicates inversion breakends open at the 5' of reported location. More specifically, in the inversion exmaples illustrated at https://software.broadinstitute.org/software/igv/interpreting_pair_orientations, the INV5 tag corresponds to the IGV "RR"/dark blue reads, and the INV3 tag corresponds to the IGV "LL"/ light blue reads. - -This format is used because single inverted junctions are often identified as part of a complex SV in real data, whereas simple reciprocal inversions are uncommon outside of simulated data. For a simple reciprocal inversion, both INV3 and INV5 junctions are expected to be reported, and they shall share the same `EVENT` INFO tag. The following is an example of a simple reciptocal inversion: - +Inversions are reported as breakends by default. For a simple reciprocal inversion, four breakends will be reported, and they shall share the same `EVENT` INFO tag. The following is an example of a simple reciptocal inversion: +``` +chr1 17124941 MantaBND:1445:0:1:1:3:0:0 T [chr1:234919886[T 999 PASS SVTYPE=BND;MATEID=MantaBND:1445:0:1:1:3:0:1;CIPOS=0,1;HOMLEN=1;HOMSEQ=T;INV5;EVENT=MantaBND:1445:0:1:0:0:0:0;JUNCTION_QUAL=254;BND_DEPTH=107;MATE_BND_DEPTH=100 GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:65,8:15,51 +chr1 17124948 MantaBND:1445:0:1:0:0:0:0 T T]chr1:234919824] 999 PASS SVTYPE=BND;MATEID=MantaBND:1445:0:1:0:0:0:1;INV3;EVENT=MantaBND:1445:0:1:0:0:0:0;JUNCTION_QUAL=999;BND_DEPTH=109;MATE_BND_DEPTH=83 GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:60,2:0,46 +chr1 234919824 MantaBND:1445:0:1:0:0:0:1 G G]chr1:17124948] 999 PASS SVTYPE=BND;MATEID=MantaBND:1445:0:1:0:0:0:0;INV3;EVENT=MantaBND:1445:0:1:0:0:0:0;JUNCTION_QUAL=999;BND_DEPTH=83;MATE_BND_DEPTH=109 GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:60,2:0,46 +chr1 234919885 MantaBND:1445:0:1:1:3:0:1 A [chr1:17124942[A 999 PASS SVTYPE=BND;MATEID=MantaBND:1445:0:1:1:3:0:0;CIPOS=0,1;HOMLEN=1;HOMSEQ=A;INV5;EVENT=MantaBND:1445:0:1:0:0:0:0;JUNCTION_QUAL=254;BND_DEPTH=100;MATE_BND_DEPTH=107 GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:65,8:15,51 +``` +A supplementary script, provided as `$MANTA_INSTALL_FOLDER/libexec/convertInversion.py`, can be applied to Manta's output vcf files to reformat inversions into single inverted sequence junctions, which was the format used in Manta versions <= 1.4.0. Two INFO tags are introduced for such format: the INV3 tag indicates inversion breakends open at the 3' of reported location, whereas the INV5 tag indicates inversion breakends open at the 5' of reported location. More specifically, in the inversion exmaples illustrated at https://software.broadinstitute.org/software/igv/interpreting_pair_orientations, the INV5 tag corresponds to the IGV "RR"/dark blue reads, and the INV3 tag corresponds to the IGV "LL"/ light blue reads. This format was informative because single inverted junctions are often identified as part of a complex SV in real data, whereas simple reciprocal inversions are uncommon outside of simulated data. For a simple reciprocal inversion, both INV3 and INV5 junctions are expected to be reported, and they shall share the same `EVENT` INFO tag. The following is the converted formant of the above example of a simple reciptocal inversion: ``` -chr1 17124940 MantaINV:3630:0:1:1:4:0 C 999 PASS END=234919885;SVTYPE=INV;SVLEN=217794945;INV5;EVENT=MantaINV:3630:0:1:0:0:0;JUNCTION_QUAL=999; GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:61,4:24,43 -chr1 17124943 MantaINV:3630:0:1:0:0:0 T 999 PASS END=234919824;SVTYPE=INV;SVLEN=217794881;INV3;EVENT=MantaINV:3630:0:1:0:0:0;JUNCTION_QUAL=999; GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:52,3:8,29 +chr1 17124940 MantaINV:1445:0:1:1:3:0 C 999 PASS END=234919885;SVTYPE=INV;SVLEN=217794945;CIPOS=0,1;CIEND=-1,0;HOMLEN=1;HOMSEQ=T;EVENT=MantaINV:1445:0:1:0:0:0;JUNCTION_QUAL=254;INV5 GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:65,8:15,51 +chr1 17124948 MantaINV:1445:0:1:0:0:0 T 999 PASS END=234919824;SVTYPE=INV;SVLEN=217794876;EVENT=MantaINV:1445:0:1:0:0:0;JUNCTION_QUAL=999;INV3 GT:FT:GQ:PL:PR:SR 0/1:PASS:999:999,0,999:60,2:0,46 ``` #### VCF INFO Fields @@ -364,8 +369,6 @@ SVINSLEN | Length of insertion SVINSSEQ | Sequence of insertion LEFT_SVINSSEQ | Known left side of insertion for an insertion of unknown length RIGHT_SVINSSEQ | Known right side of insertion for an insertion of unknown length -INV3 | Flag indicating that inversion breakends open 3' of reported location -INV5 | Flag indicating that inversion breakends open 5' of reported location BND_DEPTH | Read depth at local translocation breakend MATE_BND_DEPTH | Read depth at remote translocation mate breakend JUNCTION_QUAL | If the SV junction is part of an EVENT (ie. a multi-adjacency variant), this field provides the QUAL value for the adjacency in question only @@ -387,25 +390,41 @@ SR | Number of split-reads which strongly (Q30) support the REF or ALT alleles #### VCF FILTER Fields -ID | Description ---- | --- -MinQUAL | QUAL score is less than 20 -MinGQ | GQ score is less than 15 (filter applied at sample level and record level if all samples are filtered) -MinSomaticScore | SOMATICSCORE is less than 30 -Ploidy | For DEL & DUP variants, the genotypes of overlapping variants (with similar size) are inconsistent with diploid expectation -MaxDepth | Depth is greater than 3x the median chromosome depth near one or both variant breakends -MaxMQ0Frac | For a small variant (<1000 bases), the fraction of reads in all samples with MAPQ0 around either breakend exceeds 0.4 -NoPairSupport | For variants significantly larger than the paired read fragment size, no paired reads support the alternate allele in any sample +ID | Level | Description +--- | --- | --- +MinQUAL | Record | QUAL score is less than 20 +MinGQ | Sample | GQ score is less than 15 +MinSomaticScore | Record | SOMATICSCORE is less than 30 +Ploidy | Record | For DEL & DUP variants, the genotypes of overlapping variants (with similar size) are inconsistent with diploid expectation +MaxDepth | Record | Depth is greater than 3x the median chromosome depth near one or both variant breakends +MaxMQ0Frac | Record | For a small variant (<1000 bases), the fraction of reads in all samples with MAPQ0 around either breakend exceeds 0.4 +NoPairSupport | Record | For variants significantly larger than the paired read fragment size, no paired reads support the alternate allele in any sample +SampleFT | Record | No sample passes all the sample-level filters +HomRef | Sample | Homozygous reference call + +#### Interpretation of VCF filters + +As described above, there are two levels of filters: record level (FILTER) and sample level (FORMAT/FT). Record-level filters are generally independant to sample-level filters. However, if none of the samples passes all sample-level filters, the 'SampleFT' filter will be applied at the record level. -#### How to interpret VCF filters? +#### Interpretation of Manta's INFO/EVENT field -As described above, there are two levels of filters: record level (FILTER) and sample level (FORMAT/FT). Record-level filters are generally independant to sample-level filters. However, if none of the samples passes one record-level filter, that filter will be copied to the record level (e.g. MinGQ). +Some structural variants reported in the VCF, such as translocations, represent a single novel sequence junction in the +sample. Manta uses the `INFO/EVENT` field to indicate that two or more such junctions are hypothesized to occur +together as part of a single variant event. All individual variant records belonging to the same event will share +the same `INFO/EVENT` string. Note that although such an inference could be applied after SV calling by analyzing +the relative distance and orientation of the called variant breakpoints, +Manta incorporates this event mechanism into the calling process to increase sensitivity towards such larger-scale +events. Given that at least one junction in the event has already passed standard variant candidacy thresholds, +sensitivity is improved by lowering the evidence thresholds for additional junctions which occur in a pattern +consistent with a multi-junction event (such as a reciprocal translocation pair). -A sample-specific passing variant needs to have the record level FILTER passed, the sample level FORMAT/FT passed, and the sample level FORMAT/GT is not "0/0"(hom-reference). +Note that although this mechanism could generalize to events including an arbitrary number of junctions, +it is currently limited to 2. Thus, at present it is most useful for identifying and improving sensitivity +towards reciprocal translocation pairs. -#### What do the values in Manta's VCF ID field mean? +#### Details of Manta's VCF ID field -The VCF ID or 'identifer' field can be used for annotation, or in the case of BND ('breakend') records for translocations, the ID value is used to link breakend mates or partners. +The VCF ID or 'identifier' field can be used for annotation, or in the case of BND ('breakend') records for translocations, the ID value is used to link breakend mates or partners. An example Manta VCF ID is "MantaINS:1577:0:0:0:3:0". The value provided in this field reflects the SV association graph edge(s) from which the SV or indel was discovered. The ID value provided by Manta is primarily intended for internal use by manta developers. The value is guaranteed to be unique within any VCF file produced by Manta, and these ID values are used to link associated breakend records using the standard VCF `MATEID` key. The structure of this ID may change in the future, it is safe to use the entire value as a unique key, but parsing this value may lead to incompatibilities with future updates. @@ -588,7 +607,7 @@ Using the `--generateEvidenceBam` option, Manta can be configured to generate ba It is recommended to use this option together with the `--region` option, so that the analysis is limited to relatively small genomic regions for debugging purposes. -The evidence bam files are provided in `${MANTA_ANALYSIS_PATH}/results/evidence`, with a naming format `evidence.*.bam`. +The evidence bam files are provided in `${MANTA_ANALYSIS_PATH}/results/evidence`, with a naming format `evidence_*.*.bam`. There is one such file for each input bam of the analysis, containing evidence reads of the candidate SVs identified from that input bam. Each read in an evidence bam keeps all information from the original bam, and it contains also a customized tag in the format: `ZM:Z:${MANTA_SV_ID_1}|${EVIDENCE_TYPE},${MANTA_SV_ID_2}|${EVIDENCE_TYPE}`. For example, ZM:Z:MantaINV:5:0:1:0:0:0|PR|SRM,MantaDEL:5:1:2:0:0:0|SR * One read can have more than one of the three evidence types: PR for paired reads, SR for split reads, and SRM for split read mates. @@ -719,13 +738,13 @@ together if a more accurate filter is required. The status of a call's `IMPRECIS of its reliability. For example, in the unpaired tumor analysis output below, the records could be filtered to only include those with -`SAMPLE/PR[1] >= 15 || SAMPLE/SR[1] >= 15`. This would remove the inversion record, because the paired-read count -for the inversion allele is 13 and the split-read count is not known. The two translocation breakends would not be +`SAMPLE/PR[1] >= 15 || SAMPLE/SR[1] >= 15`. This would remove the deletion record, because the paired-read count +for the deletion allele is 13 and the split-read count is not known. The two translocation breakends would not be filtered because they have 15 and 19 split-read counts, respectively, supporting the breakend allele: ``` 11 94975747 MantaBND:0:2:3:0:0:0:1 G G]8:107653520] . PASS SVTYPE=BND;MATEID=MantaBND:0:2:3:0:0:0:0;CIPOS=0,2;HOMLEN=2;HOMSEQ=TT;BND_DEPTH=216;MATE_BND_DEPTH=735 PR:SR 722,9:463,15 -11 94975753 MantaINV:0:1:2:0:0:0 T . PASS END=94987865;SVTYPE=INV;SVLEN=12112;IMPRECISE;CIPOS=-156,156;CIEND=-150,150;INV3 PR 161,13 +11 94975753 MantaDEL:0:1:2:0:0:0 T . PASS END=94987865;SVTYPE=DEL;SVLEN=12112;IMPRECISE;CIPOS=-156,156;CIEND=-150,150 PR 161,13 11 94987872 MantaBND:0:0:1:0:0:0:0 T T[8:107653411[ . PASS SVTYPE=BND;MATEID=MantaBND:0:0:1:0:0:0:1;BND_DEPTH=171;MATE_BND_DEPTH=830 PR:SR 489,4:520,19 ``` diff --git a/src/c++/lib/alignment/AlignmentScoringUtilImpl.hh b/src/c++/lib/alignment/AlignmentScoringUtilImpl.hh index d1fcbbe0..9813ab44 100644 --- a/src/c++/lib/alignment/AlignmentScoringUtilImpl.hh +++ b/src/c++/lib/alignment/AlignmentScoringUtilImpl.hh @@ -88,10 +88,10 @@ getPathScore( } #ifdef DEBUG_PATHSCORE - log_os << __FUNCTION__ - << " path.type=" << ps.type - << " path.length=" << ps.length - << " val=" << val << "\n"; + log_os << __FUNCTION__ + << " path.type=" << ps.type + << " path.length=" << ps.length + << " val=" << val << "\n"; #endif } return val; diff --git a/src/c++/lib/alignment/GlobalLargeIndelAlignerImpl.hh b/src/c++/lib/alignment/GlobalLargeIndelAlignerImpl.hh index 69feb443..18804dd9 100644 --- a/src/c++/lib/alignment/GlobalLargeIndelAlignerImpl.hh +++ b/src/c++/lib/alignment/GlobalLargeIndelAlignerImpl.hh @@ -23,6 +23,8 @@ #include +//#define DEBUG_ALN + #ifdef DEBUG_ALN #include "blt_util/log.hh" #include diff --git a/src/c++/lib/alignment/SingleRefAlignerShared.hh b/src/c++/lib/alignment/SingleRefAlignerShared.hh index b63f3cf2..ede1aab3 100644 --- a/src/c++/lib/alignment/SingleRefAlignerShared.hh +++ b/src/c++/lib/alignment/SingleRefAlignerShared.hh @@ -42,11 +42,13 @@ struct AlignmentResult void clear() { - score=0; + score = 0; + isJumped = false; align.clear(); } ScoreType score; + bool isJumped; ///< whether alignment path includes jump state(s) while backtracking Alignment align; }; diff --git a/src/c++/lib/alignment/SingleRefAlignerSharedImpl.hh b/src/c++/lib/alignment/SingleRefAlignerSharedImpl.hh index 319d370e..34c0d5df 100644 --- a/src/c++/lib/alignment/SingleRefAlignerSharedImpl.hh +++ b/src/c++/lib/alignment/SingleRefAlignerSharedImpl.hh @@ -21,6 +21,8 @@ #include +//#define DEBUG_ALN + #if defined(DEBUG_ALN) || defined(DEBUG_ALN_MATRIX) #include "blt_util/log.hh" #endif @@ -145,6 +147,16 @@ backTraceAlignment( { assert(false && "Unknown align state"); } + + // check if the alignment path includes JUMP or JUMPINS states + if ((btrace.state==AlignState::JUMP) || (btrace.state==AlignState::JUMPINS)) + { + result.isJumped = true; +#ifdef DEBUG_ALN + log_os << "isJumped is set true" << "\n"; +#endif + } + btrace.state=nextState; ps.length++; } diff --git a/src/c++/lib/applications/EstimateSVLoci/ESLOptions.cpp b/src/c++/lib/applications/EstimateSVLoci/ESLOptions.cpp index 2e5d202a..9e6ec0fd 100644 --- a/src/c++/lib/applications/EstimateSVLoci/ESLOptions.cpp +++ b/src/c++/lib/applications/EstimateSVLoci/ESLOptions.cpp @@ -61,7 +61,7 @@ checkStandardizeUsageFile( const char* fileLabel) { std::string errorMsg; - if ( checkStandardizeInputFile(filename, fileLabel, errorMsg)) + if (checkAndStandardizeRequiredInputFilePath(filename, fileLabel, errorMsg)) { usage(os,prog,visible,errorMsg.c_str()); } diff --git a/src/c++/lib/applications/GenerateSVCandidates/EdgeRetrieverJumpBin.hh b/src/c++/lib/applications/GenerateSVCandidates/EdgeRetrieverJumpBin.hh index c371fea6..40fcff3c 100644 --- a/src/c++/lib/applications/GenerateSVCandidates/EdgeRetrieverJumpBin.hh +++ b/src/c++/lib/applications/GenerateSVCandidates/EdgeRetrieverJumpBin.hh @@ -50,7 +50,7 @@ struct EdgeRetrieverJumpBin : public EdgeRetriever const unsigned binIndex); bool - next(); + next() override; private: void diff --git a/src/c++/lib/applications/GenerateSVCandidates/FatSVCandidate.cpp b/src/c++/lib/applications/GenerateSVCandidates/FatSVCandidate.cpp index aff27abd..1ad74851 100644 --- a/src/c++/lib/applications/GenerateSVCandidates/FatSVCandidate.cpp +++ b/src/c++/lib/applications/GenerateSVCandidates/FatSVCandidate.cpp @@ -19,6 +19,7 @@ /// \file /// \author Chris Saunders +/// \author Naoki Nariai /// #include "FatSVCandidate.hh" @@ -35,10 +36,14 @@ operator<<( os << static_cast(svc); for (unsigned eIndex(0); eIndex,SVEvidenceType::SIZE> evidenceIndex_t; + /// a 3d array type to track breakpoint evidence. + /// The first dimension is evidence type, + /// the second dimension is bam index with size equal to the number of input bams, and + /// the third dimension is evidence read with size equal to the number of confident-mapping observations. + /// For each observation the value provides the index of the read used as an observation, + /// which can be used to estimate signal density vs. all reads. + typedef std::array >,SVEvidenceType::SIZE> evidenceIndex_t; evidenceIndex_t bp1EvidenceIndex; evidenceIndex_t bp2EvidenceIndex; diff --git a/src/c++/lib/applications/GenerateSVCandidates/GSCOptions.cpp b/src/c++/lib/applications/GenerateSVCandidates/GSCOptions.cpp index a363a86c..7f80e787 100644 --- a/src/c++/lib/applications/GenerateSVCandidates/GSCOptions.cpp +++ b/src/c++/lib/applications/GenerateSVCandidates/GSCOptions.cpp @@ -56,7 +56,7 @@ checkStandardizeUsageFile( const char* fileLabel) { std::string errorMsg; - if ( checkStandardizeInputFile(filename, fileLabel, errorMsg)) + if (checkAndStandardizeRequiredInputFilePath(filename, fileLabel, errorMsg)) { usage(os,prog,visible,errorMsg.c_str()); } diff --git a/src/c++/lib/applications/GenerateSVCandidates/GSCOptions.hh b/src/c++/lib/applications/GenerateSVCandidates/GSCOptions.hh index e0af4bdd..a7b2ec08 100644 --- a/src/c++/lib/applications/GenerateSVCandidates/GSCOptions.hh +++ b/src/c++/lib/applications/GenerateSVCandidates/GSCOptions.hh @@ -76,7 +76,7 @@ struct GSCOptions unsigned minCandidateSpanningCount = 3; ///< how many spanning evidence observations are required to become a candidate? - unsigned minScoredVariantSize = 51; ///< min size for scoring and scored output following candidate generation + unsigned minScoredVariantSize = 50; ///< min size for scoring and scored output following candidate generation bool isOutputContig = false; ///< if true, an assembled contig is written in VCF }; diff --git a/src/c++/lib/applications/GenerateSVCandidates/SVCandidateAssemblyRefiner.cpp b/src/c++/lib/applications/GenerateSVCandidates/SVCandidateAssemblyRefiner.cpp index 18461d22..d9d38ec2 100644 --- a/src/c++/lib/applications/GenerateSVCandidates/SVCandidateAssemblyRefiner.cpp +++ b/src/c++/lib/applications/GenerateSVCandidates/SVCandidateAssemblyRefiner.cpp @@ -21,6 +21,7 @@ /// \author Ole Schulz-Trieglaff /// \author Chris Saunders /// \author Xiaoyu Chen +/// \author Naoki Nariai /// #include "SVCandidateAssemblyRefiner.hh" @@ -402,7 +403,7 @@ isLowQualitySmallSVAlignment( if (clippedPathSize < minAlignReadLength) { #ifdef DEBUG_REFINER -// log_os << "Rejecting highest scoring contig sub-alignment. isFirst?: " << isFirstRead << ". Sub-alignmnet read length after clipping is: " << clippedPathSize << " min size is: " << minAlignReadLength << "\n"; + log_os << "Rejecting highest scoring contig sub-alignment. isFirst?: " << isFirstRead << ". Sub-alignmnet read length after clipping is: " << clippedPathSize << " min size is: " << minAlignReadLength << "\n"; #endif return true; } @@ -416,7 +417,7 @@ isLowQualitySmallSVAlignment( #ifdef DEBUG_REFINER if (isLowQualityAlignmentScore) { -// log_os << "Rejecting highest scoring contig sub-alignment. isFirst?: " << isFirstRead << ". Fraction of optimal alignment score is: " << scoreFrac << " minScoreFrac: " << minScoreFrac << "\n"; + log_os << "Rejecting highest scoring contig sub-alignment. isFirst?: " << isFirstRead << ". Fraction of optimal alignment score is: " << scoreFrac << " minScoreFrac: " << minScoreFrac << "\n"; } #endif @@ -470,7 +471,6 @@ getQuerySeqMatchCount( /// /// \param[in] maxQCRefSpan Longest flanking sequence length considered for the high quality qc requirement /// \param[out] candidateSegments Segment indices of all qualifying SV candidates are returned in this structure -/// \param[in] onlyAcceptLargeCandidates If true, don't accept smaller indel/SV candidates (intended for denoising) /// /// \return True if a qualifying candidate variant is found in the input alignment /// @@ -483,8 +483,7 @@ findCandidateVariantsFromComplexSVContigAlignment( const std::string& contigSeq, const std::string& refSeq, const unsigned minCandidateIndelSize, - std::vector >& candidateSegments, - const bool onlyAcceptLargeCandidates) + std::vector >& candidateSegments) { using namespace ALIGNPATH; @@ -588,22 +587,19 @@ findCandidateVariantsFromComplexSVContigAlignment( if (occurrences > 1) return false; } - if (onlyAcceptLargeCandidates) + // check if an indel of at least minimum size is identified + typedef std::pair segment_t; + std::vector tmpseg(candidateSegments); + candidateSegments.clear(); + for (const segment_t& segment : tmpseg) { - // only accept large indels in this case - typedef std::pair segment_t; - std::vector tmpseg(candidateSegments); - candidateSegments.clear(); - for (const segment_t& segment : tmpseg) + for (unsigned i(segment.first); i<=segment.second; ++i) { - for (unsigned i(segment.first); i<=segment.second; ++i) + if (((apath[i].type == INSERT) && (apath[i].length >= minCandidateIndelSize)) || + ((apath[i].type == DELETE) && (apath[i].length >= minCandidateIndelSize))) { - if (((apath[i].type == INSERT) && (apath[i].length >= 80)) || - ((apath[i].type == DELETE) && (apath[i].length >= 200))) - { - candidateSegments.push_back(segment); - break; - } + candidateSegments.push_back(segment); + break; } } } @@ -1112,7 +1108,6 @@ SVCandidateAssemblyRefiner( (opt.isRNA ? opt.refineOpt.RNAspanningAssembleOpt : opt.refineOpt.spanningAssembleOpt), opt.alignFileOpt, opt.referenceFilename, opt.statsFilename, opt.chromDepthFilename, header, counts, opt.isRNA, edgeTracker.remoteReadRetrievalTime), - _smallSVAligner(opt.refineOpt.smallSVAlignScores), _largeSVAligner(opt.refineOpt.largeSVAlignScores,opt.refineOpt.largeGapOpenScore), _largeInsertEdgeAligner(opt.refineOpt.largeInsertEdgeAlignScores), _largeInsertCompleteAligner(opt.refineOpt.largeInsertCompleteAlignScores), @@ -1639,10 +1634,10 @@ assembleJumpContigs( // const pos_t extraRefSize(extraRefEdgeSize+extraRefSplitSize); getSVReferenceSegments( - opt.referenceFilename, header, extraRefSize, sv, - assemblyData.bp1ref, assemblyData.bp2ref, - alignData.bp1LeadingTrim, alignData.bp1TrailingTrim, - alignData.bp2LeadingTrim, alignData.bp2TrailingTrim); + opt.referenceFilename, header, extraRefSize, sv, + assemblyData.bp1ref, assemblyData.bp2ref, + alignData.bp1LeadingTrim, alignData.bp1TrailingTrim, + alignData.bp2LeadingTrim, alignData.bp2TrailingTrim); // The *Cut values below represent sequence which will be removed from the edges of the reference region for each // breakend. In most cases this will equal extraRefSplitSize. Sometimes these values are forced to be shorter @@ -1654,10 +1649,10 @@ assembleJumpContigs( // assemble contig(s) spanning the breakend: spanningAssembler.assembleSpanningSVCandidate( - sv.bp1, sv.bp2, - bporient.isBp1Reversed, bporient.isBp2Reversed, - assemblyData.bp1ref, assemblyData.bp2ref, - assemblyData.contigs); + sv.bp1, sv.bp2, + bporient.isBp1Reversed, bporient.isBp2Reversed, + assemblyData.bp1ref, assemblyData.bp2ref, + assemblyData.contigs); return true; } @@ -1960,7 +1955,7 @@ getJumpAssembly( AlignData alignData; bool isAssemblySuccess(false); isAssemblySuccess = assembleJumpContigs(_opt, sv, extraRefEdgeSize, extraRefSplitSize, - _header, _spanningAssembler, alignData, assemblyData); + _header, _spanningAssembler, alignData, assemblyData); if (! isAssemblySuccess) return; // Align candidate contigs back to reference @@ -2006,6 +2001,7 @@ struct ContigScoringInfo int score = 0; unsigned index = 0; unsigned variantSize = 0; + bool isJumped = false; }; @@ -2118,9 +2114,8 @@ getSmallSVAssembly( log_os << __FUNCTION__ << ": start aligning contigIndex: " << contigIndex << '\n'; #endif - // Run two different aligners, one optimized to find large - // events and one optimized to find smaller events, only look - // for small events if the large event search fails: + // Run a universal aligner that can find both large events (with free gap-extension after jump) + // and smaller events (without jump) SVCandidateAssemblyData::SmallAlignmentResultType& alignment(assemblyData.smallSVAlignments[contigIndex]); std::string& extendedContig(assemblyData.extendedContigs[contigIndex]); std::vector >& candidateSegments(assemblyData.smallSVSegments[contigIndex]); @@ -2182,13 +2177,11 @@ getSmallSVAssembly( BOOST_THROW_EXCEPTION(GeneralException(oss.str())); } - /// Set isSmallSVCandidate true if an indel candidate can be nominated from alignment - /// - /// \param[in] onlyAcceptLargeCandidates If true, don't accept smaller indel/SV candidates (intended for denoising) - /// - auto testIfAlignmentNominatesIndelCandidate = [&]( - const bool onlyAcceptLargeCandidates) + // Use a universal aligner to discover any small/large indels in the contig/reference alignment. { + _largeSVAligner.align(contig.seq.begin(), contig.seq.end(), + align1RefStr.begin() + adjustedLeadingCut, align1RefStr.end() - adjustedTrailingCut, + alignment); alignment.align.beginPos += adjustedLeadingCut; getExtendedContig(alignment, contig.seq, align1RefStr, extendedContig); @@ -2208,8 +2201,7 @@ getSmallSVAssembly( contig.seq, align1RefStr, _opt.scanOpt.minCandidateVariantSize, - segments, - onlyAcceptLargeCandidates) ); + segments) ); if (isCandidate) { @@ -2218,42 +2210,18 @@ getSmallSVAssembly( { candidateSegments = segments; } + // Set isSmallSVCandidate true if an indel candidate can be nominated from alignment isSmallSVCandidate=true; } } #ifdef DEBUG_REFINER - const std::string alignerSize(onlyAcceptLargeCandidates ? "large" : "small"); - log_os << __FUNCTION__ << ": finished " << alignerSize << " aligner. contigIndex: " << contigIndex + log_os << __FUNCTION__ << ": finished aligner. contigIndex: " << contigIndex << " isSmallSVCandidate: " << isSmallSVCandidate << " alignment: " << alignment; #endif - }; - - // Start with largeSV aligner to discover any large indels in the contig/reference alignment. - { - _largeSVAligner.align( - contig.seq.begin(), contig.seq.end(), - align1RefStr.begin() + adjustedLeadingCut, align1RefStr.end() - adjustedTrailingCut, - alignment); - - testIfAlignmentNominatesIndelCandidate(true); - } - - - // If no candidate is found by the large indel aligner, try again with the small indel aligner. - /// TODO: get rid of this step - if (! isSmallSVCandidate) - { - _smallSVAligner.align( - contig.seq.begin(), contig.seq.end(), - align1RefStr.begin() + adjustedLeadingCut, align1RefStr.end() - adjustedTrailingCut, - alignment); - - testIfAlignmentNominatesIndelCandidate(false); } - // Test each alignment for suitability to be the left or right side of a large insertion. // // All practical combinations of left and right candidates will be enumerated below to see @@ -2318,12 +2286,26 @@ getSmallSVAssembly( contigInfo.index = contigIndex; contigInfo.score = alignment.score; contigInfo.variantSize = getLargestIndelSize(alignment.align.apath, candidateSegments); + contigInfo.isJumped = alignment.isJumped; }; // keep the top two highest scoring QC'd candidate: // TODO: we should keep all QC'd candidates for the small event case // FIXME : prevents us from finding overlapping events, keep vector of high-scoring contigs? - if ((! rank1Contig.isDefined) || (alignment.score > rank1Contig.score)) + + // The current contig is selected if any of the following criteria is met: + // (1) no contig is selected yet + // (2) the current contig contains JUMP/JUMPINS state, but rank1Contig does not + // (3) both (a) and (b) are true: + // (a) if both contigs contain JUMP/JUMPINS state (bothJumped), or + // if both contigs do not contain JUMP/JUMPINS state (bothNotJumped) + // (b) score of the current contig is higher than that of the rank1Contig + const bool bothJumped = alignment.isJumped && rank1Contig.isJumped; + const bool bothNotJumped = (! alignment.isJumped) && (! rank1Contig.isJumped); + + if ((! rank1Contig.isDefined) || + (alignment.isJumped && (! rank1Contig.isJumped)) || + (((bothJumped || bothNotJumped) && (alignment.score > rank1Contig.score)))) { if (rank1Contig.isDefined) { @@ -2361,27 +2343,46 @@ getSmallSVAssembly( #ifdef DEBUG_REFINER log_os << __FUNCTION__ << ": contig #" << rank1Contig.index << " has " << rank1ContigSupportReadCount - <<" support reads, with max variant size " << rank1Contig.variantSize << "\n"; + << " support reads, with max variant size " << rank1Contig.variantSize + << ", contains JUMP/JUMPINS state " << rank1Contig.isJumped << "\n"; log_os << __FUNCTION__ << ": contig #" << rank2Contig.index << " has " << rank2ContigSupportReadCount - <<" support reads, with max variant size " << rank2Contig.variantSize << "\n"; + << " support reads, with max variant size " << rank2Contig.variantSize + << ", contains JUMP/JUMPINS state " << rank2Contig.isJumped << "\n"; #endif - // The second best contig is selected if: - // (1) score2/score1 is higher than minScoreRatio - // (2) either (2a) or (2b) is true: - // (2a) supportReadCount2/supportReadCount1 is higher than minSupportReadCountRatio - // (2b) variantSize2/variantSize1 is higher than minVariantSizeRatio - // static const float minScoreRatio(0.9f); static const float minSupportReadCountRatio(1.2f); static const float minVariantSizeRatio(1.1f); - const bool rank2IsBest((rank2Contig.score > (rank1Contig.score*minScoreRatio)) && - ((rank2ContigSupportReadCount > (rank1ContigSupportReadCount*minSupportReadCountRatio)) || - (rank2Contig.variantSize > (rank1Contig.variantSize*minVariantSizeRatio)))); - if (rank2IsBest) + + // rank1Contig will be selected, if rank1Contig contains JUMP/JUMPINS state + // and rank2Contig does not + const bool rank1IsSelected((rank1Contig.isJumped && (! rank2Contig.isJumped))); + if (! rank1IsSelected) { - rank1Contig = rank2Contig; + // At this point, only possibilities are: + // both contigs contain JUMP/JUMPINS state OR + // both contigs do not contain JUMP/JUMPINS state + // + // The reason is that the previous contig selection process already + // eliminates the possibility in which + // rank1Contig does NOT contain a jump state while rank2Contig does + // + // The second best contig is selected if both (1) and (2) are true: + // (1) score2/score1 is higher than minScoreRatio + // (2) either (a) or (b) is true: + // (a) supportReadCount2/supportReadCount1 is higher than minSupportReadCountRatio + // (b) variantSize2/variantSize1 is higher than minVariantSizeRatio + // + const bool rank2IsBest((rank2Contig.score > (rank1Contig.score*minScoreRatio)) && + ((rank2ContigSupportReadCount > (rank1ContigSupportReadCount*minSupportReadCountRatio)) || + (rank2Contig.variantSize > (rank1Contig.variantSize*minVariantSizeRatio)))); + + if (rank2IsBest) + { + rank1Contig = rank2Contig; + } } + #ifdef DEBUG_REFINER log_os << __FUNCTION__ << ": contigIndex: " << rank1Contig.index << " is finally selected.\n"; #endif diff --git a/src/c++/lib/applications/GenerateSVCandidates/SVCandidateAssemblyRefiner.hh b/src/c++/lib/applications/GenerateSVCandidates/SVCandidateAssemblyRefiner.hh index 2743e162..7ab4518c 100644 --- a/src/c++/lib/applications/GenerateSVCandidates/SVCandidateAssemblyRefiner.hh +++ b/src/c++/lib/applications/GenerateSVCandidates/SVCandidateAssemblyRefiner.hh @@ -98,10 +98,7 @@ private: /// Assembler parameterized for assembly of a 'spanning' SV candidate spanning two regions const SVCandidateAssembler _spanningAssembler; - /// Aligner optimized to discover a small indel from a single SV candidate region - const GlobalAligner _smallSVAligner; - - /// Aligner optimized to discover a single large indel from a single SV candidate region + /// Aligner optimized to discover a single small/large indel from a single SV candidate region const GlobalLargeIndelAligner _largeSVAligner; const GlobalAligner _largeInsertEdgeAligner; diff --git a/src/c++/lib/applications/GenerateSVCandidates/SVFinder.cpp b/src/c++/lib/applications/GenerateSVCandidates/SVFinder.cpp index 03b3b2e3..1376ff20 100644 --- a/src/c++/lib/applications/GenerateSVCandidates/SVFinder.cpp +++ b/src/c++/lib/applications/GenerateSVCandidates/SVFinder.cpp @@ -19,6 +19,7 @@ /// \file /// \author Chris Saunders +/// \author Naoki Nariai /// #include "SVFinder.hh" @@ -123,16 +124,8 @@ SVFinder( const AllSampleReadCounts& counts(getSet().getAllSampleReadCounts()); for (unsigned bamIndex(0); bamIndex& readCandidates, const bool isExpandSVCandidateSet, SVCandidateSetSequenceFragment& fragment, - std::vector& svs) + std::vector& svs, + const unsigned bamIndex) { + const unsigned bamCount(_bamStreams.size()); + // we anticipate so few svs from the POC method, that there's no indexing on them for (const SVObservation& readCand : readCandidates) { @@ -773,10 +761,10 @@ assignFragmentObservationsToSVCandidates( fragment.svLink.emplace_back(svIndex,readCand.svEvidenceType); } - updateEvidenceIndex(fragment,readCand,sv); + updateEvidenceIndex(fragment, readCand, sv, bamIndex); // check evidence distance: - sv.merge(FatSVCandidate(readCand), isExpandSVCandidateSet); + sv.merge(FatSVCandidate(readCand, bamCount), isExpandSVCandidateSet); #ifdef DEBUG_SVDATA log_os << __FUNCTION__ << ": Added to svIndex: " << svIndex << " match_sv: " << sv << "\n"; @@ -798,7 +786,7 @@ assignFragmentObservationsToSVCandidates( log_os << __FUNCTION__ << ": New svIndex: " << newSVIndex << "\n"; #endif - svs.push_back(FatSVCandidate(readCand)); + svs.push_back(FatSVCandidate(readCand, bamCount)); svs.back().candidateIndex = newSVIndex; if (isSpanningCand) @@ -806,7 +794,7 @@ assignFragmentObservationsToSVCandidates( // ditto note above, store fragment association only when there's an SV hypothesis: fragment.svLink.emplace_back(newSVIndex,readCand.svEvidenceType); } - updateEvidenceIndex(fragment,readCand,svs.back()); + updateEvidenceIndex(fragment, readCand, svs.back(), bamIndex); } } } @@ -819,8 +807,8 @@ processSequenceFragment( const SVLocusNode& node1, const SVLocusNode& node2, const bam_header_info& bamHeader, - const reference_contig_segment& refSeq1, - const reference_contig_segment& refSeq2, + const reference_contig_segment& node1RefSeq, + const reference_contig_segment& node2RefSeq, const unsigned bamIndex, const bool isExpandSVCandidateSet, std::vector& svs, @@ -852,17 +840,19 @@ processSequenceFragment( const bam_record* remoteBamRecPtr( remoteReadPtr->isSet() ? &(remoteReadPtr->bamrec) : nullptr); - const reference_contig_segment& localRef( localReadPtr->isNode1 ? refSeq1 : refSeq2 ); + const reference_contig_segment& localRef( localReadPtr->isSourcedFromGraphEdgeNode1 ? node1RefSeq : node2RefSeq ); const reference_contig_segment* remoteRefPtr(nullptr); if (remoteReadPtr->isSet()) { - remoteRefPtr = (remoteReadPtr->isNode1 ? &refSeq1 : &refSeq2 ); + remoteRefPtr = (remoteReadPtr->isSourcedFromGraphEdgeNode1 ? &node1RefSeq : &node2RefSeq ); } _readScanner.getBreakendPair(localReadPtr->bamrec, remoteBamRecPtr, bamIndex, bamHeader, localRef, remoteRefPtr, _readCandidates); - // collapse close spanning sv candidates into complex candidates -- this reflects the fact that the + // Label all SVObservations with the sample index: + + // Collapse close spanning sv candidates into complex candidates -- this reflects the fact that the // assembler will collapse them anyway, so reduces duplicated work in the assembler; for (SVObservation& cand : _readCandidates) { @@ -899,7 +889,7 @@ processSequenceFragment( log_os << __FUNCTION__ << ": cand: " << cand << "\n"; } #endif - assignFragmentObservationsToSVCandidates(node1, node2, _readCandidates, isExpandSVCandidateSet, fragment, svs); + assignFragmentObservationsToSVCandidates(node1, node2, _readCandidates, isExpandSVCandidateSet, fragment, svs, bamIndex); } @@ -1020,7 +1010,7 @@ isBreakPointSignificant( -/// Test a spanning candidate for minimum supporting evidence level prior +/// \brief Test a spanning candidate for minimum supporting evidence level prior /// to assembly and scoring stages /// /// Note this test is applied early, and as such it is intended to only filter @@ -1032,14 +1022,15 @@ static bool isSpanningCandidateSignalSignificant( const double noiseRate, - const FatSVCandidate& sv) + const FatSVCandidate& sv, + const unsigned bamIndex) { std::vector evidence_bp1; std::vector evidence_bp2; for (unsigned evidenceTypeIndex(0); evidenceTypeIndex evidence; for (unsigned i(0); i& spanningNoiseRate) +{ + for (unsigned bamIndex(0); bamIndex& assemblyNoiseRate) +{ + for (unsigned bamIndex(0); bamIndex < bamCount; ++bamIndex) + { + if (isComplexCandidateSignalSignificant(assemblyNoiseRate[bamIndex], sv, bamIndex)) return true; + } + return false; +} + + + +/// Return enumerator value describing the candidate's filtration state, based on +/// information available in a single junction only (as opposed to /// requiring multi-junction analysis) /// static -SINGLE_FILTER::index_t +SINGLE_JUNCTION_FILTER::index_t isFilterSingleJunctionCandidate( const bool isRNA, - const double spanningNoiseRate, - const double assemblyNoiseRate, - const FatSVCandidate& sv) + const std::vector& spanningNoiseRate, + const std::vector& assemblyNoiseRate, + const FatSVCandidate& sv, + const unsigned bamCount) { - using namespace SINGLE_FILTER; + using namespace SINGLE_JUNCTION_FILTER; // don't consider candidates created from only // semi-mapped read pairs (ie. one read of the pair is MAPQ0 or MAPQsmall) - if (sv.bp1.isLocalOnly() && sv.bp2.isLocalOnly()) return SEMIMAPPED; + if (sv.bp1.isLocalOnly() && sv.bp2.isLocalOnly()) return SEMI_MAPPED; // candidates must have a minimum amount of evidence: if (isSpanningSV(sv)) { - /// TODO make sensitivity adjustments for RNA here: + // TODO make sensitivity adjustments for RNA here: if (! isRNA) { - if (! isSpanningCandidateSignalSignificant(spanningNoiseRate, sv)) return SPANNINGLOWSIGNAL; + if (! isAnySpanningCandidateSignalSignificant(bamCount, sv, spanningNoiseRate)) return SPANNING_LOW_SIGNAL; } } else if (isComplexSV(sv)) { - if (! isCandidateCountSufficient(sv)) return COMPLEXLOWCOUNT; - if (! isComplexCandidateSignalSignificant(assemblyNoiseRate, sv)) return COMPLEXLOWSIGNAL; + if (! isCandidateCountSufficient(sv)) return COMPLEX_LOW_COUNT; + if (! isAnyComplexCandidateSignalSignificant(bamCount, sv, assemblyNoiseRate)) return COMPLEX_LOW_SIGNAL; } else { @@ -1134,34 +1160,35 @@ static void filterCandidates( const bool isRNA, - const double spanningNoiseRate, - const double assemblyNoiseRate, + const std::vector& spanningNoiseRate, + const std::vector& assemblyNoiseRate, std::vector& svs, - SVFinderStats& stats) + SVFinderStats& stats, + const unsigned bamCount) { unsigned svCount(svs.size()); unsigned index(0); while (index& output_svs, SVFinderStats& stats) @@ -1211,7 +1238,7 @@ getCandidatesFromData( static const bool isAnchored(true); processSequenceFragment( - node1, node2, bamHeader, refSeq1, refSeq2, bamIndex, isAnchored, + node1, node2, bamHeader, node1RefSeq, node2RefSeq, bamIndex, isAnchored, svs, fragment, stats); } } @@ -1231,7 +1258,7 @@ getCandidatesFromData( static const bool isAnchored(false); processSequenceFragment( - node1, node2, bamHeader, refSeq1, refSeq2, bamIndex, isAnchored, + node1, node2, bamHeader, node1RefSeq, node2RefSeq, bamIndex, isAnchored, svs, pair, stats); } } @@ -1265,10 +1292,13 @@ getCandidatesFromData( } #endif - assert((_spanningNoiseRate >= 0.) && (_spanningNoiseRate <= 1.)); - assert((_assemblyNoiseRate >= 0.) && (_assemblyNoiseRate <= 1.)); + for (unsigned bamIndex(0); bamIndex= 0.) && (_spanningNoiseRate[bamIndex] <= 1.)); + assert((_assemblyNoiseRate[bamIndex] >= 0.) && (_assemblyNoiseRate[bamIndex] <= 1.)); + } - filterCandidates(_isRNA, _spanningNoiseRate, _assemblyNoiseRate, svs, stats); + filterCandidates(_isRNA, _spanningNoiseRate, _assemblyNoiseRate, svs, stats, bamCount); std::copy(svs.begin(),svs.end(),std::back_inserter(output_svs)); } @@ -1304,7 +1334,6 @@ findCandidateSVImpl( return; } - // // 1) scan through each region to identify all reads supporting // some sort of breakend in the target region, then match up read // pairs so that they can easily be accessed from each other @@ -1316,26 +1345,26 @@ findCandidateSVImpl( const SVLocus& locus(cset.getLocus(edge.locusIndex)); - reference_contig_segment refSeq1; - reference_contig_segment refSeq2; + reference_contig_segment node1RefSeq; + reference_contig_segment node2RefSeq; { GenomeInterval searchInterval; - getNodeRefSeq(bamHeader, locus, edge.nodeIndex1, _referenceFilename, searchInterval, refSeq1); + getNodeRefSeq(bamHeader, locus, edge.nodeIndex1, _referenceFilename, searchInterval, node1RefSeq); addSVNodeData(bamHeader, locus, edge.nodeIndex1, edge.nodeIndex2, - searchInterval, refSeq1, true, svData); + searchInterval, node1RefSeq, true, svData); } if (edge.nodeIndex1 != edge.nodeIndex2) { GenomeInterval searchInterval; - getNodeRefSeq(bamHeader, locus, edge.nodeIndex2, _referenceFilename, searchInterval, refSeq2); + getNodeRefSeq(bamHeader, locus, edge.nodeIndex2, _referenceFilename, searchInterval, node2RefSeq); addSVNodeData(bamHeader, locus, edge.nodeIndex2, edge.nodeIndex1, - searchInterval, refSeq2, false, svData); + searchInterval, node2RefSeq, false, svData); } const SVLocusNode& node1(locus.getNode(edge.nodeIndex1)); const SVLocusNode& node2(locus.getNode(edge.nodeIndex2)); - getCandidatesFromData(node1, node2, bamHeader, refSeq1, refSeq2, + getCandidatesFromData(node1, node2, bamHeader, node1RefSeq, node2RefSeq, svData, svs, stats); //checkResult(svData,svs); diff --git a/src/c++/lib/applications/GenerateSVCandidates/SVFinder.hh b/src/c++/lib/applications/GenerateSVCandidates/SVFinder.hh index 550c3f40..e0d93f3e 100644 --- a/src/c++/lib/applications/GenerateSVCandidates/SVFinder.hh +++ b/src/c++/lib/applications/GenerateSVCandidates/SVFinder.hh @@ -19,6 +19,7 @@ /// \file /// \author Chris Saunders +/// \author Naoki Nariai /// #pragma once @@ -67,17 +68,35 @@ struct SVFinder private: + /// Given a directed edge from the SV locus graph, process and cache data required for SV discovery + /// from the source (local) node of that edge. + /// + /// \param[in] locus Graph locus containing the input edge + /// \param[in] localNodeIndex Source (local) node of the input directed graph edge + /// \param[in] remoteNodeIndex Sink (remote) node of the input directed graph edge + /// \param[in] isLocalNodeGraphEdgeNode1 True if localNodeIndex is nodeIndex1 in the graph void addSVNodeData( const bam_header_info& bamHeader, const SVLocus& locus, - const NodeIndexType node1, - const NodeIndexType node2, + const NodeIndexType localNodeIndex, + const NodeIndexType remoteNodeIndex, const GenomeInterval& searchInterval, - const reference_contig_segment& refSeq, - const bool isNode1, + const reference_contig_segment& localNodeRefSeq, + const bool isLocalNodeGraphEdgeNode1, SVCandidateSetData& svData); + /// readCandidates are the set of hypotheses generated by individual read pair -- + /// this is the read pair which we seek to assign to one of the identified SVs (in svs) + /// or we push the candidate into svs to start a new candidate associated with this edge + /// + /// this is meant as only a temporary form of hypothesis generation, in the current system + /// we do at least delineate alternative candidates by strand and region overlap, but over + /// the longer term we should be able to delineate cluster by a clustering of possible + /// breakend locations. + /// + /// \param[in] isExpandSVCandidateSet if false, don't add new SVs or expand existing SVs + /// void assignFragmentObservationsToSVCandidates( const SVLocusNode& node1, @@ -85,18 +104,19 @@ private: const std::vector& readCandidates, const bool isExpandSVCandidateSet, SVCandidateSetSequenceFragment& fragment, - std::vector& svs); + std::vector& svs, + const unsigned bamIndex); - /// we either process the fragment to discover new SVs and expand existing SVs, - /// or we go through and add pairs to existing SVs without expansion + /// \brief Either process the fragment to discover new SVs and expand existing SVs,or + /// go through and add pairs to existing SVs without expansion /// void processSequenceFragment( const SVLocusNode& node1, const SVLocusNode& node2, const bam_header_info& bamHeader, - const reference_contig_segment& refSeq1, - const reference_contig_segment& refSeq2, + const reference_contig_segment& node1RefSeq, + const reference_contig_segment& node2RefSeq, const unsigned bamIndex, const bool isExpandSVCandidateSet, std::vector& svs, @@ -108,8 +128,8 @@ private: const SVLocusNode& node1, const SVLocusNode& node2, const bam_header_info& bamHeader, - const reference_contig_segment& refSeq1, - const reference_contig_segment& refSeq2, + const reference_contig_segment& node1RefSeq, + const reference_contig_segment& node2RefSeq, SVCandidateSetData& svData, std::vector& svs, SVFinderStats& stats); @@ -127,6 +147,8 @@ private: return *(_dFilterPtr); } + + const ReadScannerOptions _scanOpt; const std::vector _isAlignmentTumor; SVLocusSet _set; @@ -148,13 +170,13 @@ private: /// throwaway stats tracker... SampleEvidenceCounts _eCounts; - /// rate of spanning read noise estimated from the current dataset + /// rate of spanning read noise estimated from the current dataset for each sample /// - estimate is roughly (anom + split)/all - double _spanningNoiseRate; + std::vector _spanningNoiseRate; - /// rate of assembly read noise estimated from the current dataset + /// rate of assembly read noise estimated from the current dataset for each sample /// - assembly read noise means reads with edges which have high mismatch density or soft-clipping - double _assemblyNoiseRate; + std::vector _assemblyNoiseRate; EdgeRuntimeTracker& _edgeTracker; GSCEdgeStatsManager& _edgeStatMan; diff --git a/src/c++/lib/applications/GenerateSVCandidates/SVScorePairAltProcessor.hh b/src/c++/lib/applications/GenerateSVCandidates/SVScorePairAltProcessor.hh index 22f2f3b6..d8e2e264 100644 --- a/src/c++/lib/applications/GenerateSVCandidates/SVScorePairAltProcessor.hh +++ b/src/c++/lib/applications/GenerateSVCandidates/SVScorePairAltProcessor.hh @@ -77,12 +77,11 @@ struct SVScorePairAltProcessor : public SVScorePairProcessor /// what to skip in addition to the core skip test? /// - /// override to allow for shadow and chimera re-maps for large insertions: + /// This override accounts for shadow and chimera re-maps for large insertions: /// - virtual bool isSkipRecord( - const bam_record& bamRead) + const bam_record& bamRead) override { if (! isLargeInsertSV(sv)) return SVScorePairProcessor::isSkipRecord(bamRead); @@ -95,7 +94,7 @@ struct SVScorePairAltProcessor : public SVScorePairProcessor processClearedRecord( const SVId& svId, const bam_record& bamRead, - SupportFragments& svSupportFrags); + SupportFragments& svSupportFrags) override; private: static diff --git a/src/c++/lib/applications/GenerateSVCandidates/SVScorePairRefProcessor.hh b/src/c++/lib/applications/GenerateSVCandidates/SVScorePairRefProcessor.hh index 55ceb820..1f762196 100644 --- a/src/c++/lib/applications/GenerateSVCandidates/SVScorePairRefProcessor.hh +++ b/src/c++/lib/applications/GenerateSVCandidates/SVScorePairRefProcessor.hh @@ -42,5 +42,5 @@ struct SVScorePairRefProcessor : public SVScorePairProcessor processClearedRecord( const SVId& svId, const bam_record& bamRead, - SupportFragments& svSupportFrags); + SupportFragments& svSupportFrags) override; }; diff --git a/src/c++/lib/applications/GenerateSVCandidates/SVScorer.cpp b/src/c++/lib/applications/GenerateSVCandidates/SVScorer.cpp index de19303c..b43e6c62 100644 --- a/src/c++/lib/applications/GenerateSVCandidates/SVScorer.cpp +++ b/src/c++/lib/applications/GenerateSVCandidates/SVScorer.cpp @@ -1150,26 +1150,24 @@ scoreDiploidSV( } // add sample specific filters - bool isAllMinGTFiltered(true); + bool isAllSampleFiltered(true); for (unsigned sampleIndex(0); sampleIndex0) && isAllMinGTFiltered) - { - diploidInfo.filters.insert(diploidOpt.minGTFilterLabel); + if (diploidSampleInfo.filters.empty()) + isAllSampleFiltered = false; } + // If no sample passes all sample-specific filters, apply sample FT filter at the record level + if (isAllSampleFiltered) + diploidInfo.filters.insert(diploidOpt.failedSampleFTLabel); + const unsigned junctionCount(junctionData.size()); // apply high depth filter: diff --git a/src/c++/lib/applications/GetAlignmentStats/AlignmentStatsOptions.cpp b/src/c++/lib/applications/GetAlignmentStats/AlignmentStatsOptions.cpp index 52c000ab..16906970 100644 --- a/src/c++/lib/applications/GetAlignmentStats/AlignmentStatsOptions.cpp +++ b/src/c++/lib/applications/GetAlignmentStats/AlignmentStatsOptions.cpp @@ -58,7 +58,7 @@ parseOptions( std::string& errorMsg) { if (parseOptions(vm, opt.alignFileOpt, errorMsg)) return true; - if (checkStandardizeInputFile(opt.referenceFilename, "reference fasta", errorMsg)) return true; + if (checkAndStandardizeRequiredInputFilePath(opt.referenceFilename, "reference fasta", errorMsg)) return true; return false; } diff --git a/src/c++/lib/applications/GetChromDepth/ChromDepthOptions.cpp b/src/c++/lib/applications/GetChromDepth/ChromDepthOptions.cpp index c56352eb..94b8ab17 100644 --- a/src/c++/lib/applications/GetChromDepth/ChromDepthOptions.cpp +++ b/src/c++/lib/applications/GetChromDepth/ChromDepthOptions.cpp @@ -17,8 +17,6 @@ // // -/// \file - #include "ChromDepthOptions.hh" #include "blt_util/log.hh" @@ -58,8 +56,8 @@ parseOptions( ChromDepthOptions& opt, std::string& errorMsg) { - if (checkStandardizeInputFile(opt.alignmentFilename, "alignment", errorMsg)) return true; - if (checkStandardizeInputFile(opt.referenceFilename, "reference fasta", errorMsg)) return true; + if (checkAndStandardizeRequiredInputFilePath(opt.alignmentFilename, "alignment", errorMsg)) return true; + if (checkAndStandardizeRequiredInputFilePath(opt.referenceFilename, "reference fasta", errorMsg)) return true; if (vm.count("chrom")) { diff --git a/src/c++/lib/applications/MergeAlignmentStats/MergeAlignmentStats.hh b/src/c++/lib/applications/MergeAlignmentStats/MergeAlignmentStats.hh index c434f2c5..3737aa5d 100644 --- a/src/c++/lib/applications/MergeAlignmentStats/MergeAlignmentStats.hh +++ b/src/c++/lib/applications/MergeAlignmentStats/MergeAlignmentStats.hh @@ -27,11 +27,11 @@ struct MergeAlignmentStats : public illumina::Program { const char* - name() const + name() const override { return "MergeAlignmentStats"; } void - runInternal(int argc, char* argv[]) const; + runInternal(int argc, char* argv[]) const override; }; diff --git a/src/c++/lib/applications/MergeAlignmentStats/MergeAlignmentStatsOptions.cpp b/src/c++/lib/applications/MergeAlignmentStats/MergeAlignmentStatsOptions.cpp index f343e73f..64f2f978 100644 --- a/src/c++/lib/applications/MergeAlignmentStats/MergeAlignmentStatsOptions.cpp +++ b/src/c++/lib/applications/MergeAlignmentStats/MergeAlignmentStatsOptions.cpp @@ -114,7 +114,7 @@ parseMergeAlignmentStatsOptions( std::set nameCheck; for (std::string& afile : opt.statsFiles) { - if (checkStandardizeInputFile(afile,"alignment stats file",errorMsg)) break; + if (checkAndStandardizeRequiredInputFilePath(afile, "alignment stats file", errorMsg)) break; if (nameCheck.count(afile)) { std::ostringstream oss; diff --git a/src/c++/lib/applications/SummarizeAlignmentStats/SASOptions.cpp b/src/c++/lib/applications/SummarizeAlignmentStats/SASOptions.cpp index 47d7e56b..404a48c9 100644 --- a/src/c++/lib/applications/SummarizeAlignmentStats/SASOptions.cpp +++ b/src/c++/lib/applications/SummarizeAlignmentStats/SASOptions.cpp @@ -56,7 +56,7 @@ checkStandardizeUsageFile( const char* fileLabel) { std::string errorMsg; - if ( checkStandardizeInputFile(filename, fileLabel, errorMsg)) + if (checkAndStandardizeRequiredInputFilePath(filename, fileLabel, errorMsg)) { usage(os,prog,visible,errorMsg.c_str()); } diff --git a/src/c++/lib/applications/SummarizeAlignmentStats/SummarizeAlignmentStats.hh b/src/c++/lib/applications/SummarizeAlignmentStats/SummarizeAlignmentStats.hh index b8c38ed1..c51fd99d 100644 --- a/src/c++/lib/applications/SummarizeAlignmentStats/SummarizeAlignmentStats.hh +++ b/src/c++/lib/applications/SummarizeAlignmentStats/SummarizeAlignmentStats.hh @@ -27,11 +27,11 @@ struct SummarizeAlignmentStats : public illumina::Program { const char* - name() const + name() const override { return "SummarizeAlignmentStats"; } void - runInternal(int argc, char* argv[]) const; + runInternal(int argc, char* argv[]) const override; }; diff --git a/src/c++/lib/applications/SummarizeSVLoci/SummarizeSVLoci.hh b/src/c++/lib/applications/SummarizeSVLoci/SummarizeSVLoci.hh index 7eec2f7e..c0859d95 100644 --- a/src/c++/lib/applications/SummarizeSVLoci/SummarizeSVLoci.hh +++ b/src/c++/lib/applications/SummarizeSVLoci/SummarizeSVLoci.hh @@ -27,11 +27,11 @@ struct SummarizeSVLoci : public illumina::Program { const char* - name() const + name() const override { return "SummarizeSVLoci"; } void - runInternal(int argc, char* argv[]) const; + runInternal(int argc, char* argv[]) const override; }; diff --git a/src/c++/lib/applications/TestAssembler/TestAssemblerOptions.cpp b/src/c++/lib/applications/TestAssembler/TestAssemblerOptions.cpp index 16c9650e..4686af8f 100644 --- a/src/c++/lib/applications/TestAssembler/TestAssemblerOptions.cpp +++ b/src/c++/lib/applications/TestAssembler/TestAssemblerOptions.cpp @@ -58,7 +58,7 @@ parseOptions( std::string& errorMsg) { if (parseOptions(vm, opt.alignFileOpt, errorMsg)) return true; - if (checkStandardizeInputFile(opt.referenceFilename, "reference fasta", errorMsg)) return true; + if (checkAndStandardizeRequiredInputFilePath(opt.referenceFilename, "reference fasta", errorMsg)) return true; return false; } diff --git a/src/c++/lib/blt_util/compat_util.cpp b/src/c++/lib/blt_util/compat_util.cpp index 032cdc00..c3472932 100644 --- a/src/c++/lib/blt_util/compat_util.cpp +++ b/src/c++/lib/blt_util/compat_util.cpp @@ -27,28 +27,6 @@ #include -#ifdef _MSC_VER -#include "compat_util_win32_realpath.c" -#endif - - - -bool -compat_realpath(std::string& path) -{ - errno=0; - const char* newpath(realpath(path.c_str(),nullptr)); - if ((nullptr==newpath) || (errno!=0)) - { - if (nullptr!=newpath) free((void*)newpath); - return false; - } - path = newpath; - free((void*)newpath); - return true; -} - - double compat_round(const double x) diff --git a/src/c++/lib/blt_util/compat_util.hh b/src/c++/lib/blt_util/compat_util.hh index 71e51623..29e471c3 100644 --- a/src/c++/lib/blt_util/compat_util.hh +++ b/src/c++/lib/blt_util/compat_util.hh @@ -42,9 +42,3 @@ compat_round(const double x); const char* compat_basename(const char* s); - - -// gets canonical name of paths, but only when these refer to existing items -// returns false on error. -bool -compat_realpath(std::string& path); diff --git a/src/c++/lib/blt_util/compat_util_win32_realpath.c b/src/c++/lib/blt_util/compat_util_win32_realpath.c deleted file mode 100644 index 2d25f25c..00000000 --- a/src/c++/lib/blt_util/compat_util_win32_realpath.c +++ /dev/null @@ -1,95 +0,0 @@ -/* realpath.c - * $Id$ - * - * Provides an implementation of the "realpath" function, conforming - * approximately to SUSv3, and adapted for use on native Microsoft(R) - * Win32 platforms. - * - * Written by Keith Marshall - * - * This is free software. You may redistribute and/or modify it as you - * see fit, without restriction of copyright. - * - * This software is provided "as is", in the hope that it may be useful, - * but WITHOUT WARRANTY OF ANY KIND, not even any implied warranty of - * MERCHANTABILITY, nor of FITNESS FOR ANY PARTICULAR PURPOSE. At no - * time will the author accept any form of liability for any damages, - * however caused, resulting from the use of this software. - * - */ - -#ifdef _WIN32 -#include -#include -#include - -char* __cdecl -realpath( const char * name, char * resolved ) -{ - char *retname = NULL; /* we will return this, if we fail */ - - /* SUSv3 says we must set `errno = EINVAL', and return NULL, - * if `name' is passed as a NULL pointer. - */ - - if( name == NULL ) - errno = EINVAL; - - /* Otherwise, `name' must refer to a readable filesystem object, - * if we are going to resolve its absolute path name. - */ - - else if( _access( name, 4 ) == 0 ) - { - /* If `name' didn't point to an existing entity, - * then we don't get to here; we simply fall past this block, - * returning NULL, with `errno' appropriately set by `access'. - * - * When we _do_ get to here, then we can use `_fullpath' to - * resolve the full path for `name' into `resolved', but first, - * check that we have a suitable buffer, in which to return it. - */ - - if( (retname = resolved) == NULL ) - { - /* Caller didn't give us a buffer, so we'll exercise the - * option granted by SUSv3, and allocate one. - * - * `_fullpath' would do this for us, but it uses `malloc', and - * Microsoft's implementation doesn't set `errno' on failure. - * If we don't do this explicitly ourselves, then we will not - * know if `_fullpath' fails on `malloc' failure, or for some - * other reason, and we want to set `errno = ENOMEM' for the - * `malloc' failure case. - */ - - retname =(char*) malloc( _MAX_PATH ); - } - - /* By now, we should have a valid buffer. - * If we don't, then we know that `malloc' failed, - * so we can set `errno = ENOMEM' appropriately. - */ - - if( retname == NULL ) - errno = ENOMEM; - - /* Otherwise, when we do have a valid buffer, - * `_fullpath' should only fail if the path name is too long. - */ - - else if( (retname = _fullpath( retname, name, _MAX_PATH )) == NULL ) - errno = ENAMETOOLONG; - } - - /* By the time we get to here, - * `retname' either points to the required resolved path name, - * or it is NULL, with `errno' set appropriately, either of which - * is our required return condition. - */ - - return retname; -} - -/* $RCSfile$: end of file */ -#endif diff --git a/src/c++/lib/format/VcfWriterDiploidSV.cpp b/src/c++/lib/format/VcfWriterDiploidSV.cpp index d223a6b5..4bd9e3dc 100644 --- a/src/c++/lib/format/VcfWriterDiploidSV.cpp +++ b/src/c++/lib/format/VcfWriterDiploidSV.cpp @@ -61,7 +61,9 @@ addHeaderFilters() const _os << "##FILTER=\n"; _os << "##FILTER=\n"; _os << "##FILTER=\n"; - _os << "##FILTER=\n"; + _os << "##FILTER=\n"; + _os << "##FILTER=\n"; + _os << "##FILTER=\n"; } diff --git a/src/c++/lib/format/VcfWriterSV.cpp b/src/c++/lib/format/VcfWriterSV.cpp index 3595c69e..deda47ab 100644 --- a/src/c++/lib/format/VcfWriterSV.cpp +++ b/src/c++/lib/format/VcfWriterSV.cpp @@ -96,8 +96,6 @@ writeHeaderPrefix( _os << "##INFO=\n"; _os << "##INFO=\n"; _os << "##INFO=\n"; - _os << "##INFO=\n"; - _os << "##INFO=\n"; // if "--outputContig" is specified, then print out INFO tag for Assembled contig sequence if (_isOutputContig) @@ -111,7 +109,6 @@ writeHeaderPrefix( addHeaderFilters(); - _os << "##ALT=\n"; _os << "##ALT=\n"; _os << "##ALT=\n"; _os << "##ALT=\n"; @@ -333,8 +330,8 @@ writeTransloc( const SVBreakend& bpA( isFirstBreakend ? sv.bp1 : sv.bp2); const SVBreakend& bpB( isFirstBreakend ? sv.bp2 : sv.bp1); - InfoTag_t infotags; - SampleTag_t sampletags; + InfoTag_t infoTags; + SampleTag_t sampleTags; // get CHROM const std::string& chrom(_header.chrom_data[bpA.interval.tid].label); @@ -424,42 +421,42 @@ writeTransloc( } // build INFO field - infotags.push_back("SVTYPE=BND"); - infotags.push_back("MATEID="+mateId); + infoTags.push_back("SVTYPE=BND"); + infoTags.push_back("MATEID="+mateId); if (isImprecise) { - infotags.push_back("IMPRECISE"); + infoTags.push_back("IMPRECISE"); } else if (_isOutputContig) { - infotags.push_back("CONTIG=" + sv.contigSeq); + infoTags.push_back("CONTIG=" + sv.contigSeq); } if (bpARange.size() > 1) { - infotags.push_back( str( boost::format("CIPOS=%i,%i") % ((bpARange.begin_pos()+1) - pos) % (bpARange.end_pos() - pos) )); + infoTags.push_back( str( boost::format("CIPOS=%i,%i") % ((bpARange.begin_pos()+1) - pos) % (bpARange.end_pos() - pos) )); } if (! isImprecise) { const pos_t bpAPosAdjust(0); - addHomologyInfo(_referenceFilename, chrom, bpARange, bpAPosAdjust, infotags); + addHomologyInfo(_referenceFilename, chrom, bpARange, bpAPosAdjust, infoTags); } if (! insertSeq.empty()) { - infotags.push_back( str( boost::format("SVINSLEN=%i") % (insertSeq.size()) )); - infotags.push_back( str( boost::format("SVINSSEQ=%s") % (insertSeq) )); + infoTags.push_back( str( boost::format("SVINSLEN=%i") % (insertSeq.size()) )); + infoTags.push_back( str( boost::format("SVINSSEQ=%s") % (insertSeq) )); } - addSharedInfo(event, infotags); + addSharedInfo(event, infoTags); - modifyInfo(event, infotags); - modifyTranslocInfo(sv, isFirstBreakend, adata, infotags); + modifyInfo(event, infoTags); + modifyTranslocInfo(sv, isFirstBreakend, adata, infoTags); - modifySample(sv, sampletags); + modifySample(sv, sampleTags); #ifdef DEBUG_VCF - addDebugInfo(bpA, bpB, isFirstBreakend, adata, infotags); + addDebugInfo(bpA, bpB, isFirstBreakend, adata, infoTags); #endif // write out record: @@ -473,8 +470,8 @@ writeTransloc( _os << '\t'; writeFilter(); _os << '\t'; - makeInfoField(infotags,_os); // INFO - makeFormatSampleField(sampletags, _os); // FORMAT + SAMPLE + makeInfoField(infoTags,_os); // INFO + makeFormatSampleField(sampleTags, _os); // FORMAT + SAMPLE _os << '\n'; } @@ -497,7 +494,7 @@ writeTranslocPair( void VcfWriterSV:: -writeInvdel( +writeIndel( const SVCandidate& sv, const SVId& svId, const bool isIndel, @@ -706,22 +703,6 @@ writeInvdel( } } - if (svId.svType == EXTENDED_SV_TYPE::INVERSION) - { - if (sv.bp1.state == SVBreakendState::RIGHT_OPEN) - { - infoTags.push_back("INV3"); - } - else if (sv.bp1.state == SVBreakendState::LEFT_OPEN) - { - infoTags.push_back("INV5"); - } - else - { - assert(false && "Unexpected inversion configuration"); - } - } - addSharedInfo(event, infoTags); modifyInfo(event, infoTags); @@ -798,14 +779,14 @@ writeSVCore( try { - if (isSVTransloc(svType)) + if (isSVTransloc(svType) || isSVInv(svType)) { writeTranslocPair(sv, svId, svData, adata, event); } else { const bool isIndel(isSVIndel(svType)); - writeInvdel(sv, svId, isIndel, event); + writeIndel(sv, svId, isIndel, event); } } catch (...) diff --git a/src/c++/lib/format/VcfWriterSV.hh b/src/c++/lib/format/VcfWriterSV.hh index 23510668..4220666c 100644 --- a/src/c++/lib/format/VcfWriterSV.hh +++ b/src/c++/lib/format/VcfWriterSV.hh @@ -170,7 +170,7 @@ private: /// \param isIndel if true, the variant is a simple right/left breakend insert/delete combination void - writeInvdel( + writeIndel( const SVCandidate& sv, const SVId& svId, const bool isIndel, diff --git a/src/c++/lib/manta/JunctionIdGenerator.cpp b/src/c++/lib/manta/JunctionIdGenerator.cpp index 0603f039..bc007691 100644 --- a/src/c++/lib/manta/JunctionIdGenerator.cpp +++ b/src/c++/lib/manta/JunctionIdGenerator.cpp @@ -40,7 +40,7 @@ getId( svId.localId = str(_SVIdFormatter % label(svId.svType) % edge.locusIndex % edge.nodeIndex1 % edge.nodeIndex2 % sv.candidateIndex % sv.assemblyAlignIndex % sv.assemblySegmentIndex ); - if (isSVTransloc(svId.svType)) + if (isSVTransloc(svId.svType) || isSVInv(svId.svType)) { svId.mateId = svId.localId + ":1"; svId.localId = svId.localId + ":0"; diff --git a/src/c++/lib/manta/MultiJunctionUtil.cpp b/src/c++/lib/manta/MultiJunctionUtil.cpp index e78a0143..b2d654d8 100644 --- a/src/c++/lib/manta/MultiJunctionUtil.cpp +++ b/src/c++/lib/manta/MultiJunctionUtil.cpp @@ -238,9 +238,9 @@ setPartner( } - -// 1 is the new partner -// 2 is the previously connected partner +/* +// \param spanIndex1 the new partner +// \param spanIndex2 the previously connected partner static void resetPartners( @@ -271,6 +271,7 @@ resetPartners( setPartner(spanPartners,newType,maxPartnerDistance,spanIndex1,spanIndex2); setPartner(spanPartners,newType,maxPartnerDistance,spanIndex2,spanIndex1); } +*/ @@ -319,6 +320,8 @@ findMultiJunctionCandidates( { using namespace MJ_INTERACTION; + bool hasMultiJunction(false); + for (unsigned spanIndexA(0); (spanIndexA+1) -/// convert independent SV candidates into multi-junction event candidates +/// \brief Convert independent SV candidates into multi-junction event candidates /// -/// given a set of un-associated single-junction SV candidates, analyze which +/// Given a set of un-associated single-junction SV candidates, analyze which /// candidate junctions could potentially be treated as a single multi-junction /// event (such as a reciprocal translocation) /// -/// right now multi-junction events are limited to pairs of (spannning) SV candidates, where +/// Right now multi-junction events are limited to pairs of (spannning) SV candidates, where /// the breakends of both junctions in the pair are proximal and meeting the expected orientation /// pattern consistent with a reciprocal translocation. /// diff --git a/src/c++/lib/manta/ReadGroupStatsUtil.cpp b/src/c++/lib/manta/ReadGroupStatsUtil.cpp index 695aa461..55f339b9 100644 --- a/src/c++/lib/manta/ReadGroupStatsUtil.cpp +++ b/src/c++/lib/manta/ReadGroupStatsUtil.cpp @@ -1078,7 +1078,7 @@ extractReadGroupStatsFromAlignmentFile( if (! isNormal) { - chromHighestPos[chromIndex] += chromSize[chromIndex]/100; + chromHighestPos[chromIndex] += std::max(1, chromSize[chromIndex]/100); #ifdef DEBUG_RPS std::cerr << " Jump to chrid: " << chromIndex << " position: " << chromHighestPos[chromIndex] << "\n"; @@ -1106,7 +1106,7 @@ extractReadGroupStatsFromAlignmentFile( #ifdef DEBUG_RPS std::cerr << "No read found in the previous region.\n"; #endif - chromHighestPos[chromIndex] += chromSize[chromIndex]/100; + chromHighestPos[chromIndex] += std::max(1, chromSize[chromIndex]/100); } } diff --git a/src/c++/lib/manta/SVCandidate.hh b/src/c++/lib/manta/SVCandidate.hh index d3b17e12..107837a1 100644 --- a/src/c++/lib/manta/SVCandidate.hh +++ b/src/c++/lib/manta/SVCandidate.hh @@ -49,6 +49,16 @@ struct SVCandidate return _isImprecise; } + /// \brief Test if this SVCandidate intersects with \p rhs + /// + /// Two SVCandidates intersect if their breakend regions overlap in the same direction. + /// + /// In the schematic below, the intersecting candidate pairs are (1,2) and (2,3) + /// + /// Candidate1: >bp2>----------------------------------->>bp1>> + /// Candidate2: >>>>>>>bp1>>>>>>>--------------->>bp2>> + /// Candidate3: >>>bp2>>>--------------->>>>>>>bp1>>>>>> + /// Candidate3: <<second; } - /// get evidence associated with a specific sample group: + /// \brief Get evidence associated with a specific sample group: const SVCandidateSetSequenceFragmentSampleGroup& getDataGroup(const unsigned bamIndex) const { diff --git a/src/c++/lib/manta/SVCandidateUtil.hh b/src/c++/lib/manta/SVCandidateUtil.hh index 377f6b02..0a2460d3 100644 --- a/src/c++/lib/manta/SVCandidateUtil.hh +++ b/src/c++/lib/manta/SVCandidateUtil.hh @@ -133,6 +133,19 @@ isSVIndel(const index_t idx) } } +inline +bool +isSVInv(const index_t idx) +{ + switch (idx) + { + case INVERSION: + return true; + default: + return false; + } +} + // provide a shortened label (mostly from the VCF spec) inline const char* @@ -145,7 +158,7 @@ label(const index_t idx) case INTRATRANSLOC: return "BND"; case INVERSION: - return "INV"; + return "BND"; case INSERT: return "INS"; case DELETE: diff --git a/src/c++/lib/manta/SVLocusScanner.hh b/src/c++/lib/manta/SVLocusScanner.hh index 28fda1ea..42d6cc80 100644 --- a/src/c++/lib/manta/SVLocusScanner.hh +++ b/src/c++/lib/manta/SVLocusScanner.hh @@ -215,13 +215,17 @@ struct SVLocusScanner std::vector& loci, SampleEvidenceCounts& eCounts) const; - /// get local and remote breakends for each SV Candidate which can be extracted from a read pair + /// \brief Find all candidate SV observations from a single input read pair /// - /// if remote read is not available, set remoteReadPtr to nullptr and a best estimate will be generated for the remote breakend + /// If the alignment record of the remote read from the read pair is not available, set remoteReadPtr to nullptr + /// and a best estimate will be generated for the remote breakend. /// - /// for all candidates, if one breakend is estimated from localRead and one is estimated from remoteRead, then - /// the local breakend will be placed in candidate bp1 and the remote breakend will be placed in candidate.bp2 + /// For all SVObservations generated, if one breakend is estimated from localRead and one is estimated from + /// remoteRead, then the local breakend will be placed in candidate bp1 and the remote breakend will be placed + /// in candidate.bp2 /// + /// \param[out] candidates Report all candidates found in the input read pair to this structure. This vector is + /// cleared on input. void getBreakendPair( const bam_record& localRead, diff --git a/src/c++/lib/manta/SVReferenceUtil.hh b/src/c++/lib/manta/SVReferenceUtil.hh index 51c2abab..30268a92 100644 --- a/src/c++/lib/manta/SVReferenceUtil.hh +++ b/src/c++/lib/manta/SVReferenceUtil.hh @@ -52,7 +52,7 @@ isRefRegionValid( /// to this interval /// /// \param[in] extraRefEdgeSize add this value to the ends of each -/// interval prior to chomosome length clipping and reference +/// interval prior to chromosome length clipping and reference /// retrieval /// \param[out] leadingTrim indicates how much was cut from the /// front of the requested interval (with edge buffer) diff --git a/src/c++/lib/options/AlignmentFileOptionsParser.cpp b/src/c++/lib/options/AlignmentFileOptionsParser.cpp index bac42d82..4f187682 100644 --- a/src/c++/lib/options/AlignmentFileOptionsParser.cpp +++ b/src/c++/lib/options/AlignmentFileOptionsParser.cpp @@ -82,7 +82,7 @@ parseOptions( std::set nameCheck; for (std::string& afile : opt.alignmentFilenames) { - if (checkStandardizeInputFile(afile,"alignment file",errorMsg)) break; + if (checkAndStandardizeRequiredInputFilePath(afile, "alignment file", errorMsg)) break; if (nameCheck.count(afile)) { std::ostringstream oss; diff --git a/src/c++/lib/options/CallOptionsDiploid.hh b/src/c++/lib/options/CallOptionsDiploid.hh index 95eea396..d57cf4a2 100644 --- a/src/c++/lib/options/CallOptionsDiploid.hh +++ b/src/c++/lib/options/CallOptionsDiploid.hh @@ -37,16 +37,22 @@ struct CallOptionsDiploid unsigned minPassAltScore = 20; std::string minAltFilterLabel = "MinQUAL"; - /// below this GQ value, the SAMPLE filter is marked in the VCF - unsigned minPassGTScore = 10; - std::string minGTFilterLabel = "MinGQ"; - // control filtration based on MQ0 fraction: float maxMQ0Frac = 0.4f; std::string maxMQ0FracLabel = "MaxMQ0Frac"; /// filter for large SVs with no pair support std::string noPairSupportLabel = "NoPairSupport"; + + /// no sample passes all sample-specific filters + std::string failedSampleFTLabel = "SampleFT"; + + /// below this GQ value, the SAMPLE filter is marked in the VCF + unsigned minPassGTScore = 10; + std::string minGTFilterLabel = "MinGQ"; + + /// the SAMPLE filter for home-ref calls + std::string homRefLabel = "HomRef"; }; diff --git a/src/c++/lib/options/SVRefinerOptions.hh b/src/c++/lib/options/SVRefinerOptions.hh index 08627202..2501f6fb 100644 --- a/src/c++/lib/options/SVRefinerOptions.hh +++ b/src/c++/lib/options/SVRefinerOptions.hh @@ -24,9 +24,9 @@ typedef IterativeAssemblerOptions AssemblerOptions; -namespace SUPERTMP +namespace ALIGNERPARAM { -static const int largeGapOpenScore(-24); +static const int largeGapOpenScore(-100); } @@ -40,27 +40,24 @@ struct SVRefinerOptions /// match, mismatch, open score ratios taken from bwa defaults (but not extend!) : /// SVRefinerOptions() : - smallSVAlignScores(2, -8, -12, 0, -1), - largeSVAlignScores(2, -8, -18, -1, -1), + largeSVAlignScores(2, -8, -24, -1, -1), largeInsertEdgeAlignScores(2, -8, -18, -1, -1), - largeInsertCompleteAlignScores(2, -8, SUPERTMP::largeGapOpenScore, 0, -1), + largeInsertCompleteAlignScores(2, -8, ALIGNERPARAM::largeGapOpenScore, 0, -1), spanningAlignScores(2, -8, -12, -1, -1), - largeGapOpenScore(SUPERTMP::largeGapOpenScore), - jumpScore(-25), + largeGapOpenScore(ALIGNERPARAM::largeGapOpenScore), + jumpScore(-100), RNAspanningAlignScores(2, -8, -19, -1, -1), RNAJumpScore(-100), RNAIntronOpenScore(-15), RNAIntronOffEdgeScore(-1), contigFilterScores(2, -8, -18, 0, -1) { - spanningAssembleOpt.minContigLength=75; ///< For breakend-spanning assemblies we require a larger contig than for small-variant assemblies + spanningAssembleOpt.minContigLength = 75; ///< For breakend-spanning assemblies we require a larger contig than for small-variant assemblies RNAspanningAssembleOpt.minContigLength = 75; ///< For breakend-spanning assemblies we require a larger contig than for small-variant assemblies RNAspanningAssembleOpt.minWordLength = 31; /// Use smaller kmer for RNA - } /// parameters for small SV assembly/alignment: - AlignmentScores smallSVAlignScores; AlignmentScores largeSVAlignScores; // large SV but at a single assembly locus AlignmentScores largeInsertEdgeAlignScores; AlignmentScores largeInsertCompleteAlignScores; diff --git a/src/c++/lib/options/optionsUtil.cpp b/src/c++/lib/options/optionsUtil.cpp index 0a0ca834..40d03606 100644 --- a/src/c++/lib/options/optionsUtil.cpp +++ b/src/c++/lib/options/optionsUtil.cpp @@ -26,7 +26,7 @@ bool -checkStandardizeInputFile( +checkAndStandardizeRequiredInputFilePath( std::string& filename, const char* fileLabel, std::string& errorMsg) diff --git a/src/c++/lib/options/optionsUtil.hh b/src/c++/lib/options/optionsUtil.hh index 76a20fc8..77d78903 100644 --- a/src/c++/lib/options/optionsUtil.hh +++ b/src/c++/lib/options/optionsUtil.hh @@ -22,13 +22,11 @@ #include -/// check if input file exists and is usable as -/// input, if so canonicalize the name +/// Check if input path exists and is usable as input, if so, convert the input filename to an absolute path. /// -/// In case of error return true and provide error -/// message +/// In case of error return true and provide error message bool -checkStandardizeInputFile( +checkAndStandardizeRequiredInputFilePath( std::string& filename, const char* fileLabel, std::string& errorMsg); diff --git a/src/c++/lib/svgraph/SVLocusSet.cpp b/src/c++/lib/svgraph/SVLocusSet.cpp index 55a48146..c62494b4 100644 --- a/src/c++/lib/svgraph/SVLocusSet.cpp +++ b/src/c++/lib/svgraph/SVLocusSet.cpp @@ -1416,7 +1416,7 @@ SVLocusSet:: SVLocusSet( const char* filename, const bool isSkipIndex) - : SVLocusSet(SVLocusSetOptions(),bam_header_info(),{}) + : SVLocusSet(SVLocusSetOptions(),bam_header_info(), {}) { using namespace boost::archive; diff --git a/src/demo/expectedResults/somaticSV.vcf.gz b/src/demo/expectedResults/somaticSV.vcf.gz index 879d5b8d..4cfd4d48 100644 Binary files a/src/demo/expectedResults/somaticSV.vcf.gz and b/src/demo/expectedResults/somaticSV.vcf.gz differ diff --git a/src/python/lib/mantaWorkflow.py b/src/python/lib/mantaWorkflow.py index e0dc9972..1c2ed5ec 100644 --- a/src/python/lib/mantaWorkflow.py +++ b/src/python/lib/mantaWorkflow.py @@ -458,7 +458,7 @@ def mergeSupportBams(self, mergeBamTasks, taskPrefix="", isNormal=True, bamIdx=0 for bamPath in bamList: # merge support bams - supportBamFile = self.paths.getFinalSupportBamPath(bamPath) + supportBamFile = self.paths.getFinalSupportBamPath(bamPath, bamIdx) mergeCmd = [ sys.executable, self.params.mantaMergeBam, self.params.samtoolsBin, self.paths.getSortedSupportBamMask(bamIdx), @@ -770,10 +770,10 @@ def getSortedSupportBamMask(self, bamIdx): return os.path.join(self.getHyGenDir(), "evidence_*.bam_%s.sorted.bam" % (bamIdx)) - def getFinalSupportBamPath(self, bamPath): + def getFinalSupportBamPath(self, bamPath, bamIdx): bamPrefix = os.path.splitext(os.path.basename(bamPath))[0] return os.path.join(self.params.evidenceDir, - "evidence.%s.bam" % (bamPrefix)) + "evidence_%s.%s.bam" % (bamIdx, bamPrefix)) def getSupportBamListPath(self, bamIdx): return os.path.join(self.getHyGenDir(), diff --git a/src/python/libexec/convertInversion.py b/src/python/libexec/convertInversion.py new file mode 100644 index 00000000..e250e7b9 --- /dev/null +++ b/src/python/libexec/convertInversion.py @@ -0,0 +1,291 @@ +#!/usr/bin/env python2 +# +# Manta - Structural Variant and Indel Caller +# Copyright (c) 2013-2018 Illumina, Inc. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# + + +import sys +import gzip +from io import BufferedReader +from subprocess import check_output +from os import path +from os.path import exists + + +class VcfRecord: + + def __init__(self, inline): + tokens = inline.strip().split('\t') + + self.chr = tokens[0] + self.pos = int(tokens[1]) + self.vid = tokens[2] + self.ref = tokens[3] + self.alt = tokens[4] + self.qual = tokens[5] + self.filter = tokens[6] + self.info = tokens[7].split(';') + self.others = "\t".join(tokens[8:]) + + # Create a dictionary for INFO + self.infoDict ={} + for infoItem in self.info: + items = infoItem.split('=') + if len(items) == 1: + self.infoDict[items[0]] = True + elif len(items) > 1: + self.infoDict[items[0]] = items[1] + + self.isINV3 = False + self.isINV5 = False + self.mateChr = "" + self.matePos = -1 + + + def checkInversion(self): + + def getMateInfo(splitChar): + items = self.alt.split(splitChar) + [self.mateChr, matePos] = items[1].split(':') + self.matePos = int(matePos) + + if self.alt.startswith('['): + getMateInfo('[') + if self.mateChr == self.chr: + self.isINV5 = True + elif self.alt.endswith(']'): + getMateInfo(']') + if self.mateChr == self.chr: + self.isINV3 = True + + + def makeLine(self): + infoStr = ";".join(self.info) + + self.line = "\t".join((self.chr, + str(self.pos), + self.vid, + self.ref, + self.alt, + self.qual, + self.filter, + infoStr, + self.others + ))+"\n" + + +def scanVcf(vcfFile): + + invMateDict = {} + + if vcfFile.endswith('gz'): + gzfp = gzip.open(vcfFile, 'rb') + fpVcf = BufferedReader(gzfp) + else: + fpVcf = open(vcfFile, 'rb') + + for line in fpVcf: + if line[0] == '#': + continue + + vcfRec = VcfRecord(line) + vcfRec.checkInversion() + if vcfRec.isINV3 or vcfRec.isINV5: + if vcfRec.vid in invMateDict: + # update mate INFO + invMateDict[vcfRec.vid] = vcfRec.infoDict + else: + mateId = vcfRec.infoDict["MATEID"] + invMateDict[mateId] = "" + + return invMateDict + + +def getReference(samtools, refFasta, chrom, start, end): + region = "%s:%d-%d" % (chrom, start, end) + samtoolsOut = check_output([samtools, "faidx", refFasta, region]) + refSeq = "" + for seq in samtoolsOut.split('\n'): + if not seq.startswith(">"): + refSeq += seq + + return refSeq.upper() + + +def writeLines(lines): + for line in lines: + sys.stdout.write(line) + + +def convertInversions(samtools, refFasta, vcfFile, invMateDict): + isHeaderInfoAdded = False + isHeaderAltAdded = False + lineBuffer = [] + bufferedChr = "" + bufferedPos = -1 + + if vcfFile.endswith('gz'): + gzfp = gzip.open(vcfFile, 'rb') + fpVcf = BufferedReader(gzfp) + else: + fpVcf = open(vcfFile, 'rb') + + for line in fpVcf: + if line.startswith('#'): + if (not isHeaderInfoAdded) and line.startswith("##FORMAT="): + sys.stdout.write("##INFO=\n") + sys.stdout.write("##INFO=\n") + isHeaderInfoAdded = True + + if (not isHeaderAltAdded) and line.startswith("##ALT="): + sys.stdout.write("##ALT=\n") + isHeaderAltAdded = True + + sys.stdout.write(line) + continue + + vcfRec = VcfRecord(line) + + # skip mate record + if vcfRec.vid in invMateDict: + continue + + vcfRec.checkInversion() + if vcfRec.isINV3 or vcfRec.isINV5: + if vcfRec.isINV5: + # adjust POS for INV5 + vcfRec.pos -= 1 + vcfRec.matePos -= 1 + vcfRec.ref = getReference(samtools, refFasta, + vcfRec.chr, vcfRec.pos, vcfRec.pos) + + # update manta ID + vidSuffix = vcfRec.vid.split("MantaBND")[1] + idx = vidSuffix.rfind(':') + vcfRec.vid = "MantaINV%s" % vidSuffix[:idx] + + # symbolic ALT + vcfRec.alt = "" + + # add END + infoEndStr = "END=%d" % vcfRec.matePos + + newInfo = [infoEndStr] + for infoItem in vcfRec.info: + if infoItem.startswith("SVTYPE"): + # change SVTYPE + newInfo.append("SVTYPE=INV") + # add SVLEN + infoSvLenStr = "SVLEN=%d" % (vcfRec.matePos-vcfRec.pos) + newInfo.append(infoSvLenStr) + + elif infoItem.startswith("CIPOS"): + newInfo.append(infoItem) + + # set CIEND + isImprecise = "IMPRECISE" in vcfRec.infoDict + # for imprecise calls, set CIEND to the mate breakpoint's CIPOS + if isImprecise: + mateId = vcfRec.infoDict["MATEID"] + mateInfoDict = invMateDict[mateId] + infoCiEndStr = "CIEND=%s" % (mateInfoDict["CIPOS"]) + newInfo.append(infoCiEndStr) + # for precise calls, set CIEND w.r.t HOMLEN + else: + if "HOMLEN" in vcfRec.infoDict: + infoCiEndStr = "CIEND=-%s,0" % vcfRec.infoDict["HOMLEN"] + newInfo.append(infoCiEndStr) + + elif infoItem.startswith("HOMSEQ"): + # update HOMSEQ for INV5 + if vcfRec.isINV5: + cipos = vcfRec.infoDict["CIPOS"].split(',') + homSeqStart = vcfRec.pos + int(cipos[0]) + 1 + homSeqEnd = vcfRec.pos + int(cipos[1]) + refSeq = getReference(samtools, refFasta, vcfRec.chr, + homSeqStart, homSeqEnd) + infoHomSeqStr = "HOMSEQ=%s" % refSeq + newInfo.append(infoHomSeqStr) + else: + newInfo.append(infoItem) + + # skip BND-specific tags + elif (infoItem.startswith("MATEID") or + infoItem.startswith("BND_DEPTH") or + infoItem.startswith("MATE_BND_DEPTH")): + continue + + # update event ID + elif infoItem.startswith("EVENT"): + eidSuffix = vcfRec.infoDict["EVENT"].split("MantaBND")[1] + idx = vidSuffix.rfind(':') + infoEventStr = "EVENT=MantaINV%s" % eidSuffix[:idx] + newInfo.append(infoEventStr) + + # apply all other tags + else: + newInfo.append(infoItem) + + # add INV3/INV5 tag + if vcfRec.isINV3: + newInfo.append("INV3") + elif vcfRec.isINV5: + newInfo.append("INV5") + + vcfRec.info = newInfo + + vcfRec.makeLine() + + # make sure the vcf is sorted in genomic order + if (not vcfRec.chr == bufferedChr) or (vcfRec.pos > bufferedPos): + if lineBuffer: + writeLines(lineBuffer) + + lineBuffer = [vcfRec.line] + bufferedChr = vcfRec.chr + bufferedPos = vcfRec.pos + elif vcfRec.pos < bufferedPos: + lineBuffer.insert(0, vcfRec.line) + else: + lineBuffer.append(vcfRec.line) + + if lineBuffer: + writeLines(lineBuffer) + + + +if __name__=='__main__': + + usage = "convertInversion.py \n" + if len(sys.argv) <= 3: + sys.stderr.write(usage) + sys.exit(1) + + samtools = sys.argv[1] + refFasta = sys.argv[2] + vcfFile = sys.argv[3] + + for inputFile in [samtools, refFasta, vcfFile]: + if not(exists(inputFile)): + errMsg = ('File %s does not exist.' + % inputFile) + sys.stderr.write(errMsg + '\nProgram exits.') + sys.exit(1) + + invMateDict = scanVcf(vcfFile) + convertInversions(samtools, refFasta, vcfFile, invMateDict) diff --git a/src/python/libexec/updateSampleFTFilter.py b/src/python/libexec/updateSampleFTFilter.py new file mode 100644 index 00000000..599e8acd --- /dev/null +++ b/src/python/libexec/updateSampleFTFilter.py @@ -0,0 +1,158 @@ +#!/usr/bin/env python2 +# +# Manta - Structural Variant and Indel Caller +# Copyright (c) 2013-2018 Illumina, Inc. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# + +""" +Given a germline VCF from Manta, add/refresh the "SampleFT" FILTER on each record which does not +include a passing FORMAT/FT in at least one sample. +""" + +import os, sys +import re + + +class VCFID : + CHROM = 0 + POS = 1 + REF = 3 + ALT = 4 + QUAL = 5 + FILTER = 6 + INFO = 7 + FORMAT = 8 + + + +def getOptions() : + + from optparse import OptionParser + + usage = "usage: %prog < vcf > vcf_with_updated_filters" + parser = OptionParser(usage=usage, description=__doc__) + + (options,args) = parser.parse_args() + + if (len(args) != 0) or sys.stdin.isatty() : + parser.print_help() + sys.exit(2) + + return (options,args) + + + +class Constants : + filterLabel="SampleFT" + filterHeaderText="##FILTER=" % (filterLabel) + + nonZeroAllele = re.compile('[1-9]') + + + +def processVariantRecordLine(outfp, line) : + """ + Process each VCF variant record and write results to outfp stream after potentially modifying the record's + FILTER value + """ + w=line.strip().split('\t') + filters=w[VCFID.FILTER].split(';') + + assert(len(filters)) + if filters[0] == "." or filters[0] == "PASS" : filters = [] + + formatTags=w[VCFID.FORMAT].split(':') + assert(len(formatTags)) + if formatTags[0] == "." : formatTags = [] + + def getFilterField() : + if len(filters) == 0 : + return "PASS" + else : + return ";".join(filters) + + def outputModifiedRecord() : + w[VCFID.FILTER] = getFilterField() + outfp.write("\t".join(w) + "\n") + + def addFilterAndOutput() : + if Constants.filterLabel in filters : + outfp.write(line) + else : + filters.append(Constants.filterLabel) + outputModifiedRecord() + + def removeFilterAndOutput() : + if Constants.filterLabel not in filters : + outfp.write(line) + else : + filters.remove(Constants.filterLabel) + outputModifiedRecord() + + try : + ftIndex = formatTags.index("FT") + except ValueError: + addFilterAndOutput() + return + + isPassed=False + for sampleIndex in range(VCFID.FORMAT+1, len(w)) : + sampleVals=w[sampleIndex].split(':') + ft=sampleVals[ftIndex] + + if (ft == "PASS") : + isPassed=True + break + + if isPassed : + removeFilterAndOutput() + else : + addFilterAndOutput() + + + + +def main() : + + (_,_) = getOptions() + + infp=sys.stdin + outfp=sys.stdout + + isFilterDescriptionFound=False + for line in infp : + + # Scan VCF header to determine if the NoPassedVariantGTs filter description needs to be added: + # + if line.startswith("##") : + if line.startswith("##FILTER") : + if line.find(Constants.filterLabel) != -1 : + isFilterDescriptionFound=True + elif line.startswith("#") : + if not isFilterDescriptionFound : + outfp.write(Constants.filterHeaderText + "\n") + + if line.startswith("#") : + outfp.write(line) + continue + + # Handle all remaining (non-header) content: + processVariantRecordLine(outfp, line) + + + +main()