Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates and new features #47

Merged
merged 14 commits into from
Jun 26, 2024
12 changes: 7 additions & 5 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ on: [push, pull_request]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.12"]
steps:
- uses: actions/checkout@v2
- uses: actions/cache@v2
Expand All @@ -13,18 +16,17 @@ jobs:
key: ${{ runner.os }}-pip-${{ hashFiles('**/setup.py') }}
restore-keys: |
${{ runner.os }}-pip-
- name: Set up Python 3.8
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: 3.8
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install .[rest_api]
pip install nose coverage python-coveralls
pip install .[rest_api,tests]
- name: Run regular unit tests
run: |
nosetests protmapper -v --with-coverage --cover-inclusive --cover-package=protmapper
pytest protmapper/tests --cov=protmapper
- name: Run CLI smoketests
run: |
protmapper protmapper/tests/cli_input.csv output.csv --no_methionine_offset --no_orthology_mapping --no_isoform_mapping
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM python:3.6
FROM python:3.11

RUN pip install protmapper[rest_api] && \
python -m protmapper.resources
Expand Down
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
BSD 2-Clause License

Copyright (c) 2018, INDRA Labs
Copyright (c) 2024, Gyori lab
All rights reserved.

Redistribution and use in source and binary forms, with or without
Expand Down
72 changes: 58 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,43 @@
# Protmapper
The Protmapper maps references to protein sites to the human reference
The Protmapper maps protein sites to the human reference
sequence based on UniProt, PhosphoSitePlus, and manual curation.


## Installation
## Installation and usage

### Python package
The Protmapper is a Python package that is available on PyPI and can be
installed as:

```
```bash
pip install protmapper
```

### Docker container
Alternatively, the Protmapper Docker container can be run to expose it as
a REST API as:
Protmapper can be run as a local service via a Docker container exposing a
REST API as:

```bash
docker run -d -p 8008:8008 gyorilab/protmapper:latest
```

Example: once the container is running, you can send requests to the REST API

```bash
curl -X POST -H "Content-Type: application/json" -d '{"site_list": [["P28482", "uniprot", "T", "184"]]}' http://localhost:8008/map_sitelist_to_human_ref
```
docker run -d -p 8008:8008 labsyspharm/protmapper:latest

which is equivalent to the following Python code using the `requests` package

```python
import requests
url = 'http://localhost:8008/map_sitelist_to_human_ref'
data = {'site_list': [['P28482', 'uniprot', 'T', '184']]}
response = requests.post(url, json=data)
print(response.json())
```

## Command line interface
### Command line interface
In addition to supporting usage via a Python API and a REST service,
Protmapper also provides a command line interface that can be used as follows.

Expand All @@ -39,7 +55,7 @@ positional arguments:
on the protein.
output Path to the output file to be generated. Each line of
the output file corresponds to a line in the input
file. Each linerepresents a mapped site produced by
file. Each line represents a mapped site produced by
Protmapper.

optional arguments:
Expand All @@ -61,25 +77,53 @@ optional arguments:
--no_isoform_mapping If given, will not check sequence positions for known
modifications in other human isoforms of the protein
(based on PhosphoSitePlus data).
```

Example: the sample file [cli_input.csv](https://raw.githubusercontent.com/gyorilab/protmapper/master/protmapper/tests/cli_input.csv)
has the following content

```csv
MAPK1,hgnc,T,183
MAPK1,hgnc,T,184
MAPK1,hgnc,T,185
MAPK1,hgnc,T,186
```

By running the following command

```bash
protmapper cli_input.csv output.csv
```

we get `output.csv` which has the following content

```csv
up_id,error_code,valid,orig_res,orig_pos,mapped_id,mapped_res,mapped_pos,description,gene_name
P28482,,False,T,183,P28482,T,185,INFERRED_MOUSE_SITE,MAPK1
P28482,,False,T,184,P28482,T,185,INFERRED_METHIONINE_CLEAVAGE,MAPK1
P28482,,True,T,185,,,,VALID,MAPK1
P28482,,False,T,186,,,,NO_MAPPING_FOUND,MAPK1
```



## Documentation
For a detailed documentation of the Protmapper, visit http://protmapper.readthedocs.io

## Funding
The development of protmapper is funded under the DARPA Automated Scientific Discovery Framework project (ARO grant W911NF018-1-0124).
The development of Protmapper is funded under the DARPA grants W911NF018-1-0124
and HR00112220036.

## Citation

```bibtex
@article{bachman2019protmapper,
author = {Bachman, John A and Gyori, Benjamin M and Sorger, Peter K},
@article{bachman2022protmapper,
author = {Bachman, John A and Sorger, Peter K and Gyori, Benjamin M},
doi = {10.1101/822668},
journal = {bioRxiv},
publisher = {Cold Spring Harbor Laboratory},
title = {{Assembling a phosphoproteomic knowledge base using ProtMapper to normalize phosphosite information from databases and text mining}},
url = {https://www.biorxiv.org/content/early/2019/11/06/822668.1},
year = {2019}
title = {{Assembling a corpus of phosphoproteomic annotations using ProtMapper to normalize site information from databases and text mining}},
url = {https://www.biorxiv.org/content/10.1101/822668v4},
year = {2022}
}
```
8 changes: 7 additions & 1 deletion protmapper/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@
__all__ = ['map_sites', 'get_site_annotations', 'MappedSite',
'InvalidSiteException', 'ProtMapper', 'default_mapper',
'resource_dir']


__version__ = '0.0.29'

import os
import logging

Expand All @@ -15,5 +21,5 @@
logger = logging.getLogger('protmapper')

if not os.environ.get('INITIAL_RESOURCE_DOWNLOAD'):
from protmapper.api import ProtMapper, MappedSite
from protmapper.api import *
from protmapper.resources import resource_dir
159 changes: 153 additions & 6 deletions protmapper/api.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
__all__ = ['map_sites', 'get_site_annotations', 'MappedSite',
'InvalidSiteException', 'ProtMapper', 'default_mapper']

import os
import csv
import pickle
import logging
import tqdm
from requests.exceptions import HTTPError
from protmapper.resources import resource_dir_path
from protmapper import phosphosite_client, uniprot_client
Expand All @@ -14,6 +18,45 @@
'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y')


def map_sites(site_list, do_methionine_offset=True, do_orthology_mapping=True,
do_isoform_mapping=True):
"""Return a list of mapped sites for a list of input sites.

Parameters
----------
site_list : list of tuple
Each tuple in the list consists of the following entries:
(prot_id, prot_ns, residue, position).
do_methionine_offset : boolean
Whether to check for off-by-one errors in site position (possibly)
attributable to site numbering from mature proteins after
cleavage of the initial methionine. If True, checks the reference
sequence for a known modification at 1 site position greater
than the given one; if there exists such a site, creates the
mapping. Default is True.
do_orthology_mapping : boolean
Whether to check sequence positions for known modification sites
in mouse or rat sequences (based on PhosphoSitePlus data). If a
mouse/rat site is found that is linked to a site in the human
reference sequence, a mapping is created. Default is True.
do_isoform_mapping : boolean
Whether to check sequence positions for known modifications
in other human isoforms of the protein (based on PhosphoSitePlus
data). If a site is found that is linked to a site in the human
reference sequence, a mapping is created. Default is True.

Returns
-------
list of :py:class:`protmapper.api.MappedSite`
A list of MappedSite objects, one corresponding to each site in
the input list.
"""
return default_mapper.map_sitelist_to_human_ref(
site_list, do_methionine_offset=do_methionine_offset,
do_orthology_mapping=do_orthology_mapping,
do_isoform_mapping=do_isoform_mapping)


class InvalidSiteException(Exception):
pass

Expand Down Expand Up @@ -128,6 +171,88 @@ def has_mapping(self):
return (not self.not_invalid()) and \
(self.mapped_pos is not None and self.mapped_res is not None)

def get_site(self):
"""Return the site information as a tuple.

Returns
-------
tuple
A tuple containing the following entries:
(prot_id, prot_ns, residue, position).
"""
if self.not_invalid() or not self.has_mapping():
return self.up_id, 'uniprot', self.orig_res, self.orig_pos
else:
return self.mapped_id, 'uniprot', self.mapped_res, self.mapped_pos


def mapped_sites_to_sites(mapped_sites, include_invalid=False):
"""Return a list of sites from a list of MappedSite objects.

Parameters
----------
mapped_sites : list of :py:class:`protmapper.api.MappedSite`
A list of MappedSite objects.
include_invalid : Optional[bool]
If True, include sites that are known to be invalid in the output.
Default is False.

Returns
-------
list of tuple
A list of tuples, each containing the following entries:
(prot_id, prot_ns, residue, position).
"""
return [ms.get_site() for ms in mapped_sites
if ms.not_invalid() or include_invalid]


def get_site_annotations(sites):
"""Return annotations for a list of sites.

Parameters
----------
sites : list of tuple
Each tuple in the list consists of the following entries:
(prot_id, residue, position). where prot_id has to be a
UniProt ID.

Returns
-------
dict
A dictionary mapping each site to a list of annotations.
"""
annotations = _load_annotations()
site_annotations = {}
for site in sites:
site_annotations[site] = annotations.get(site, [])
return site_annotations


def _load_annotations():
import csv
from .resources import resource_manager
from collections import defaultdict
annotations_fname = resource_manager.get_create_resource_file('annotations')
evidence_fname = resource_manager.get_create_resource_file('annotations_evidence')
annotations_by_site = defaultdict(list)
# Read evidence into a dict keyed by annotation ID
evidences = defaultdict(list)
with open(evidence_fname, 'r') as fh:
reader = csv.DictReader(fh)
for row in reader:
evidences[row['ID']].append(row)
evidences = dict(evidences)
# Read annotations into a dict keyed by the site and add evidence
with open(annotations_fname, 'r') as fh:
reader = csv.DictReader(fh)
for row in reader:
site = (row['TARGET_UP_ID'], row['TARGET_RES'], row['TARGET_POS'])
row['evidence'] = evidences.get(row['ID'])
annotations_by_site[site].append(row)
annotations_by_site = dict(annotations_by_site)
return annotations_by_site


class ProtMapper(object):
"""
Expand Down Expand Up @@ -210,14 +335,33 @@ def __del__(self):
except:
pass

def map_sitelist_to_human_ref(self, site_list, **kwargs):
def map_sitelist_to_human_ref(self, site_list, do_methionine_offset=True,
do_orthology_mapping=True,
do_isoform_mapping=True):
"""Return a list of mapped sites for a list of input sites.

Parameters
----------
site_list : list of tuple
Each tuple in the list consists of the following entries:
(prot_id, prot_ns, residue, position).
do_methionine_offset : boolean
Whether to check for off-by-one errors in site position (possibly)
attributable to site numbering from mature proteins after
cleavage of the initial methionine. If True, checks the reference
sequence for a known modification at 1 site position greater
than the given one; if there exists such a site, creates the
mapping. Default is True.
do_orthology_mapping : boolean
Whether to check sequence positions for known modification sites
in mouse or rat sequences (based on PhosphoSitePlus data). If a
mouse/rat site is found that is linked to a site in the human
reference sequence, a mapping is created. Default is True.
do_isoform_mapping : boolean
Whether to check sequence positions for known modifications
in other human isoforms of the protein (based on PhosphoSitePlus
data). If a site is found that is linked to a site in the human
reference sequence, a mapping is created. Default is True.

Returns
-------
Expand All @@ -226,12 +370,15 @@ def map_sitelist_to_human_ref(self, site_list, **kwargs):
the input list.
"""
mapped_sites = []
for ix, (prot_id, prot_ns, residue, position) in enumerate(site_list):
logger.info("Mapping site %d of %d, cache size %d" %
(ix + 1, len(site_list), len(self._cache)))
for ix, (prot_id, prot_ns, residue, position) in \
tqdm.tqdm(enumerate(site_list), desc='Mapping sites',
total=len(site_list)):
try:
ms = self.map_to_human_ref(prot_id, prot_ns, residue, position,
**kwargs)
ms = self.map_to_human_ref(
prot_id, prot_ns, residue, position,
do_methionine_offset=do_methionine_offset,
do_orthology_mapping=do_orthology_mapping,
do_isoform_mapping=do_isoform_mapping)
mapped_sites.append(ms)
except Exception as e:
logger.error("Error occurred mapping site "
Expand Down
1 change: 0 additions & 1 deletion protmapper/phosphosite_client.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
import csv
import gzip
import logging
from os.path import dirname, abspath, join
from collections import namedtuple, defaultdict
from protmapper.resources import resource_manager

Expand Down
Loading
Loading