Skip to content

Commit

Permalink
Merge pull request #98 from zwets/point-to-resfinder
Browse files Browse the repository at this point in the history
Remove standalone PointFinder support
  • Loading branch information
fmaguire authored Jan 3, 2025
2 parents 3b407f5 + 2127e9a commit fdd000d
Show file tree
Hide file tree
Showing 15 changed files with 220 additions and 318 deletions.
18 changes: 7 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
# hAMRonization

This repo contains the hAMRonization module and CLI parser tools combine the outputs of
18 disparate antimicrobial resistance gene detection tools into a single unified format.
17 disparate antimicrobial resistance gene detection tools into a single unified format.

This is an implementation of the [hAMRonization AMR detection specification scheme](docs/hAMRonization_specification_details.csv) which supports gene presence/absence resistance and mutational resistance (if supported by the underlying tool).

Expand Down Expand Up @@ -66,7 +66,7 @@ options:
-v, --version show program's version number and exit
Tools with hAMRonizable reports:
{abricate,amrfinderplus,amrplusplus,ariba,csstar,deeparg,fargene,groot,kmerresistance,resfams,resfinder,mykrobe,pointfinder,rgi,srax,srst2,staramr,tbprofiler,summarize}
{abricate,amrfinderplus,amrplusplus,ariba,csstar,deeparg,fargene,groot,kmerresistance,resfams,resfinder,mykrobe,rgi,srax,srst2,staramr,tbprofiler,summarize}
abricate hAMRonize abricate's output report i.e., OUTPUT.tsv
amrfinderplus hAMRonize amrfinderplus's output report i.e., OUTPUT.tsv
amrplusplus hAMRonize amrplusplus's output report i.e., gene.tsv
Expand All @@ -80,11 +80,8 @@ Tools with hAMRonizable reports:
report`)
kmerresistance hAMRonize kmerresistance's output report i.e., OUTPUT.res
resfams hAMRonize resfams's output report i.e., resfams.tblout
resfinder hAMRonize resfinder's output report i.e.,
ResFinder_results_tab.txt
resfinder hAMRonize resfinder's JSON output report (use -j to produce)
mykrobe hAMRonize mykrobe's output report i.e., OUTPUT.json
pointfinder hAMRonize pointfinder's output report i.e.,
PointFinder_results.txt
rgi hAMRonize rgi's output report i.e., OUTPUT.txt or
OUTPUT_bwtoutput.gene_mapping_data.txt
srax hAMRonize srax's output report i.e., sraX_detected_ARGs.tsv
Expand Down Expand Up @@ -193,7 +190,7 @@ If you want to write multiple reports to one file, this `.write` method can acce
Currently implemented parsers and the last tool version for which they have been validated:

1. [abricate](hAMRonization/AbricateIO.py): last updated for v1.0.0
2. [amrfinderplus](hAMRonization/AmrFinderPlusIO.py): last updated for v3.12.18
2. [amrfinderplus](hAMRonization/AmrFinderPlusIO.py): last updated for v4.0.3
3. [amrplusplus](hAMRonization/AmrPlusPlusIO.py): last updated for c6b097a
4. [ariba](hAMRonization/AribaIO.py): last updated for v2.14.6
5. [csstar](hAMRonization/CSStarIO.py): last updated for v2.1.0
Expand All @@ -202,7 +199,7 @@ Currently implemented parsers and the last tool version for which they have been
8. [groot](hAMRonization/GrootIO.py): last updated for v1.1.2
9. [kmerresistance](hAMRonization/KmerResistanceIO.py): late updated for v2.2.0
10. [mykrobe](test/data/raw_outputs/mykrobe/report.json): last updated for v0.8.1
11. [pointfinder](hAMRonization/PointFinderIO.py): last updated for v4.1.0
11. ~pointfinder~ (removed, PointFinder is now integrated in ResFinder)
12. [resfams](hAMRonization/ResFamsIO.py): last updated for hmmer v3.3.2
13. [resfinder](hAMRonization/ResFinderIO.py): last updated for v4.6.0
14. [rgi](hAMRonization/RgiIO.py) (includes RGI-BWT) last updated for v5.2.0
Expand Down Expand Up @@ -263,13 +260,12 @@ First fork this repository and set up a development environment (replacing `YOUR

```
git clone https://github.com/YOURUSERNAME/hAMRonization
conda create -n hAMRonization
conda create -n hAMRonization python pip pytest flake8
conda activate hAMRonization
cd hAMRonization
pip install pytest flake8
pip install -e .
```

## Testing and Linting

On every commit github actions automatically runs tests and linting to check
Expand Down
161 changes: 80 additions & 81 deletions hAMRonization/AmrFinderPlusIO.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,78 +18,56 @@


class AmrFinderPlusIterator(hAMRonizedResultIterator):

nuc_field_map = {
"Protein id": None,
"Contig id": "input_sequence_id",
"Start": "input_gene_start",
"Stop": "input_gene_stop",
"Strand": "strand_orientation",
"Element symbol": "gene_symbol",
"Element name": "gene_name",
"Scope": None,
"Type": None,
"Subtype": None,
"Class": "drug_class",
"Subclass": "antimicrobial_agent",
"Method": None,
"Target length": "input_gene_length",
"Reference sequence length": "reference_gene_length",
"% Coverage of reference": "coverage_percentage",
"% Identity to reference": "sequence_identity",
"Alignment length": None,
"Closest reference accession": "reference_accession",
"Closest reference name": None,
"HMM accession": None,
"HMM description": None,
"Hierarchy node": None,
# Fields we compute below (not in TSV)
"amino_acid_mutation": "amino_acid_mutation",
"nucleotide_mutation": "nucleotide_mutation",
"genetic_variation_type": "genetic_variation_type",
}

# AMP outputs the same column set for nuc and prot detections,
# with Start and Stop always in nt units; however target and
# reference length are reported in AA for proteins.
prot_field_map = nuc_field_map.copy()
prot_field_map.update({
"Target length": "input_protein_length",
"Reference sequence length": "reference_protein_length"
})

def __init__(self, source, metadata):
metadata["analysis_software_name"] = "amrfinderplus"
metadata["reference_database_name"] = "NCBI Reference Gene Database"
self.metadata = metadata

# check source for whether AMFP has been run in protein or nt mode

nucleotide_field_mapping = {
"Protein identifier": None,
"Contig id": "input_sequence_id",
"Start": "input_gene_start",
"Stop": "input_gene_stop",
"Strand": "strand_orientation",
"Gene symbol": "gene_symbol",
"Sequence name": "gene_name",
"Scope": None,
"Element type": None,
"Element subtype": None,
"Class": "drug_class",
"Subclass": "antimicrobial_agent",
"Method": None,
"Target length": "input_protein_length",
"Reference sequence length": "reference_protein_length",
"% Coverage of reference sequence": "coverage_percentage",
"% Identity to reference sequence": "sequence_identity",
"Alignment length": None,
"Accession of closest sequence": "reference_accession",
"Name of closest sequence": None,
"HMM id": None,
"HMM description": None,
"AA Mutation": "amino_acid_mutation",
"Nucleotide Mutation": "nucleotide_mutation",
"genetic_variation_type": "genetic_variation_type",
}
protein_field_mapping = {
"Protein identifier": "input_sequence_id",
"Gene symbol": "gene_symbol",
"Sequence name": "gene_name",
"Scope": None,
"Element": None,
"Element subtype": None,
"Class": "drug_class",
"Subclass": "antimicrobial_agent",
"Method": None,
"Target length": "input_protein_length",
"Reference sequence length": "reference_protein_length",
"% Coverage of reference sequence": "coverage_percentage",
"% Identity to reference sequence": "sequence_identity",
"Alignment length": None,
"Accession of closest sequence": "reference_accession",
"Name of closest sequence": None,
"HMM id": None,
"HMM description": None,
"AA Mutation": "amino_acid_mutation",
"genetic_variation_type": "genetic_variation_type",
}

with open(source) as fh:
header = next(fh).strip().split("\t")
try:
first_result = next(fh)
prot_id = header.index("Protein identifier")
if first_result.strip().split("\t")[prot_id] == "NA":
self.field_mapping = nucleotide_field_mapping
else:
self.field_mapping = protein_field_mapping
except StopIteration:
# doesn't really matter which mapping as this error indicates
# this is an empty results file
self.field_mapping = nucleotide_field_mapping

super().__init__(source, self.field_mapping, self.metadata)
# We pass None for the field_map as it differs depending on
# whether we return a nucleotide or protein variant detection.
# TODO: refactor field_map out of super's constructor, and make
# it a parameter on super's hARMonize().
super().__init__(source, None, self.metadata)

def parse(self, handle):
"""
Expand All @@ -98,11 +76,16 @@ def parse(self, handle):
skipped_truncated = 0
reader = csv.DictReader(handle, delimiter="\t")
for result in reader:
# replace NA value with None for consitency

# Replace NA value with None for consistency
for field, value in result.items():
if value == "NA":
result[field] = None

# Skip reported virulence genes
if result['Type'] == "VIRULENCE":
continue

# AFP reports partial hits so to avoid misleadingly listing these
# as present skip results with INTERNAL_STOP
# recommended by developers
Expand All @@ -113,24 +96,40 @@ def parse(self, handle):
# "POINT" indicates mutational resistance
# amrfinderplus has no special fields but the mutation itself is
# appended to the symbol name so we want to split this
result["AA Mutation"] = None
result["Nucleotide Mutation"] = None
result["genetic_variation_type"] = GENE_PRESENCE
result['amino_acid_mutation'] = None
result['nucleotide_mutation'] = None
result['genetic_variation_type'] = GENE_PRESENCE

if result["Element subtype"] == "POINT":
gene_symbol, mutation = result["Gene symbol"].rsplit("_", 1)
result["Gene symbol"] = gene_symbol
if result['Subtype'] == "POINT":
gene_symbol, mutation = result['Element symbol'].rsplit("_", 1)
result['Element symbol'] = gene_symbol
_, ref, pos, alt, _ = re.split(r"(\D+)(\d+)(\D+)", mutation)
# this means it is a protein mutation
if result["Method"] in ["POINTX", "POINTP"]:
result["AA Mutation"] = f"p.{ref}{pos}{alt}"
result["genetic_variation_type"] = AMINO_ACID_VARIANT
elif result["Method"] == "POINTN":
if result['Method'] in ["POINTX", "POINTP"]:
result['amino_acid_mutation'] = f"p.{ref}{pos}{alt}"
result['genetic_variation_type'] = AMINO_ACID_VARIANT
elif result['Method'] == "POINTN":
# e.g., 23S_G2032G ampC_C-11C -> c.2032G>G
result["Nucleotide Mutation"] = f"c.{pos}{ref}>{alt}"
result["genetic_variation_type"] = NUCLEOTIDE_VARIANT
result['nucleotide_mutation'] = f"c.{pos}{ref}>{alt}"
result['genetic_variation_type'] = NUCLEOTIDE_VARIANT

# Determine the field_map to use depending on the method used
# The following seems to cover all bases with a minimum of fuss
have_prot = result['Protein id'] is not None
method = result['Method']
if method.endswith('P') or method.endswith('X'):
field_map = self.prot_field_map
elif method.endswith('N'):
field_map = self.nuc_field_map
elif method in ['COMPLETE', 'HMM']:
field_map = self.prot_field_map if have_prot else self.nuc_field_map
else:
warnings.warn(f"Assuming unknown method {method} implies a protein detection"
f" in {self.metadata['input_file_name']}")
field_map = self.prot_field_map

yield self.hAMRonize(result, self.metadata)
# This uses the "override hack" that should perhaps be cleaned up
yield self.hAMRonize(result, self.metadata, field_map)

if skipped_truncated > 0:
warnings.warn(f"Skipping {skipped_truncated} records with INTERNAL_STOP "
Expand Down
16 changes: 12 additions & 4 deletions hAMRonization/Interfaces.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,16 +50,24 @@ def __init__(self, source, field_map, metadata):
except Exception:
self.stream.close()

def hAMRonize(self, report_data, metadata):
# TODO: the field_map_override is a half-hack to support the scenario
# (as for amrfinderplus) where different records need different mappings,
# so setting a field_map in the constructor makes no sense.
# It might be cleaner to remove it from the constructor altogether and
# make it a parameter of this method (which is the only place where it
# is referenced anyway), and subclasses can trivially pass it in.
def hAMRonize(self, report_data, metadata, field_map_override=None):
"""
Convert a line of parsed AMR report in original format to the
hAMRonization specification
- report_result parsed dict of single results from report
- metadata dict of additional metadata fields that need added
- report_data parsed dict of single result from report
- metadata dict of additional metadata fields
- field_map_override optional override of field_map passed in c'tor
"""
hAMRonized_result_data = {**metadata}

for original_field, hAMRonized_field in self.field_map.items():
field_map = field_map_override or self.field_map
for original_field, hAMRonized_field in field_map.items():
if hAMRonized_field:
hAMRonized_result_data[hAMRonized_field] = report_data[original_field]

Expand Down
68 changes: 0 additions & 68 deletions hAMRonization/PointFinderIO.py

This file was deleted.

4 changes: 2 additions & 2 deletions hAMRonization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@ All reimplemented by @fmaguire in refactored code.

Original parser and mapping authors:
- AbricateIO.py: @dfornika
- AmrFinderPlusIO.py: @dfornika @fmaguire
- AmrFinderPlusIO.py: @zwets @dfornika @fmaguire
- AmrPlusPlusIO.py: @fmaguire
- AribaIO.py: @fmaguire
- CSStarIO.py: @fmaguire
- DeepArgIO.py: @fmaguire
- GrootIO.py: @fmaguire
- KmerResistanceIO.py: @fmaguire
- ResFamsIO.py: @fmaguire
- ResFinderIO.py: @raphenya @fmaguire
- ResFinderIO.py: @zwets @raphenya @fmaguire
- PointFinderIO.py: @fmaguire
- RgiIO.py: @dfornika @raphenya @fmaguire
- SraxIO.py: @fmaguire
Expand Down
Loading

0 comments on commit fdd000d

Please sign in to comment.