Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RefSeqMrnaId -> UniProt accession mapping #16

Open
mariacuria opened this issue Nov 14, 2024 · 7 comments
Open

RefSeqMrnaId -> UniProt accession mapping #16

mariacuria opened this issue Nov 14, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@mariacuria
Copy link
Contributor

  1. Is there software that does this? Search message boards.
  2. Manually check 5-10 rows of the table. Does cBio have RefSeq NP_<...>? - No. => Need mapping from NM to NP.
  3. Once you have the NP column, go to UniProt (or use their API) and add the column "UniProt accession".
  4. You will get the FASTA sequences for all NPs and all UniProt accessions. Do pairwise alignment (should take a couple of hours on the server, run overnight) all against all. Use BLAST, CLUSTAL or T-COFFEE or whatever.
  5. From the pairwise alignment you will get the equivalent UniProt position. Add it to your table. E. g., you have position 94 in the mRNA, in 80% of the cases it should be the same position, but in 20% it could be position 120 in the UniProt canonical.
  6. Add QC procedure in the parsing alignment file with at least 95% or something positions to be aligned.
  7. Document everything.
  8. Show the code to @seankim658.
  9. Show the results during the Friday meeting.

Before you start doing this, manually do this for EGFR. Get one position and do the entire workflow manually and show @rajamazumder.

Do the frequency based on the number of patients and show this column next Friday.

@mariacuria mariacuria self-assigned this Nov 14, 2024
@mariacuria
Copy link
Contributor Author

New workflow:

  1. Having the hg19 genomic positions, map them to ensembl hg38 at the genome level.
  2. Grab the corresponding ensembl transcript ID.
  3. Map to ensembl protein ID.
  4. Map to UniProt accession.
  5. Show @rykahsay on Mon in the internal meeting.

@mariacuria mariacuria added the enhancement New feature or request label Nov 18, 2024
@mariacuria
Copy link
Contributor Author

mariacuria commented Nov 21, 2024

  • Find a minimal tuple that uniquely characterizes each chromosomal position in order to trace GRCh38 positions back to the original json objects containing GRCh37 positions
  • Extract chromosomal positions from json objects that are already in GRCh38
  • Find ENSP IDs
  • Find UniProt ids, including isoforms
  • Map UniProt ids to UniProt canonical accession numbers using human_proteome_masterlist from GlyGen

@mariacuria
Copy link
Contributor Author

mariacuria commented Nov 22, 2024

  • 3_get_ensp.py: handle chromosomal positions for which no ENSP has been found (do they even exist?)

@mariacuria
Copy link
Contributor Author

mariacuria commented Nov 27, 2024

#utils dir: /data/shared/repos/biomuta-old/utils

#config.json: /data/shared/repos/biomuta-old/pipeline/config.json

Script that generates the mapping between ENSP IDs and UniProt accession numbers and whether or not they are canonical: /data/shared/repos/biomuta-old/pipeline/convert_step2/cbioportal/4_canonical_yes_no.py

Its output (toy file, I haven't run with the main file yet): /data/shared/repos/biomuta-old/generated_datasets/2024_10_22/mapping_ids/canonical_toy.json

Script that compares fasta sequences that needs modifications (in the same dir as 4_canonical_yes_no.py: 5_compare_fasta.py. Also, instead of printing to console, please write into a file (format doesn't matter) and see if we have 100% match for every sequence.

@mariacuria
Copy link
Contributor Author

mariacuria commented Dec 17, 2024

Update on ENSEMBL protein ID -> UniProt accession mapping:

  • 111,176 unique ENSEMBL protein IDs were found in cBioPortal data
    • 87,679 IDs were mapped using GlyGen's human_protein_transcriptlocus.csv -> ensp_to_uniprot.json
    • 22,854 IDs were mapped using UniProt API -> gffutils_ensp_to_uniprot_mappings.json
      • 12,433 correspond to UniProt canonical accessions
      • 6,426 non-canonical
      • Need to check what happened to 3,995 remaining IDs
    • 643 IDs remain unmapped
  • Filtering out non-canonical accessions: done.

@mariacuria
Copy link
Contributor Author

mariacuria commented Dec 18, 2024

Filter canonical accessions (4_canonical_yes_no.py):

  • ensp_to_uniprot.json > ensp_to_uniprot_canonical.json
  • gffutils_ensp_to_uniprot_mappings.json > ensp_to_uniprot_canonical_gffutils.json

@mariacuria
Copy link
Contributor Author

Checking why only 87,679 out of 111,176 ENSP IDs were found in GlyGen's human_protein_transcriptlocus.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants