RefSeqMrnaId -> UniProt accession mapping #16

mariacuria · 2024-11-14T21:10:35Z

Is there software that does this? Search message boards.
Manually check 5-10 rows of the table. Does cBio have RefSeq NP_<...>? - No. => Need mapping from NM to NP.
Once you have the NP column, go to UniProt (or use their API) and add the column "UniProt accession".
You will get the FASTA sequences for all NPs and all UniProt accessions. Do pairwise alignment (should take a couple of hours on the server, run overnight) all against all. Use BLAST, CLUSTAL or T-COFFEE or whatever.
From the pairwise alignment you will get the equivalent UniProt position. Add it to your table. E. g., you have position 94 in the mRNA, in 80% of the cases it should be the same position, but in 20% it could be position 120 in the UniProt canonical.
Add QC procedure in the parsing alignment file with at least 95% or something positions to be aligned.
Document everything.
Show the code to @seankim658.
Show the results during the Friday meeting.

Before you start doing this, manually do this for EGFR. Get one position and do the entire workflow manually and show @rajamazumder.

Do the frequency based on the number of patients and show this column next Friday.

mariacuria · 2024-11-15T19:46:00Z

New workflow:

Having the hg19 genomic positions, map them to ensembl hg38 at the genome level.
Grab the corresponding ensembl transcript ID.
Map to ensembl protein ID.
Map to UniProt accession.
Show @rykahsay on Mon in the internal meeting.

mariacuria · 2024-11-21T19:00:09Z

Find a minimal tuple that uniquely characterizes each chromosomal position in order to trace GRCh38 positions back to the original json objects containing GRCh37 positions
Extract chromosomal positions from json objects that are already in GRCh38
Find ENSP IDs
Find UniProt ids, including isoforms
Map UniProt ids to UniProt canonical accession numbers using human_proteome_masterlist from GlyGen

mariacuria · 2024-11-22T19:47:55Z

3_get_ensp.py: handle chromosomal positions for which no ENSP has been found (do they even exist?)

mariacuria · 2024-11-27T16:34:01Z

#utils dir: /data/shared/repos/biomuta-old/utils

#config.json: /data/shared/repos/biomuta-old/pipeline/config.json

Script that generates the mapping between ENSP IDs and UniProt accession numbers and whether or not they are canonical: /data/shared/repos/biomuta-old/pipeline/convert_step2/cbioportal/4_canonical_yes_no.py

Its output (toy file, I haven't run with the main file yet): /data/shared/repos/biomuta-old/generated_datasets/2024_10_22/mapping_ids/canonical_toy.json

Script that compares fasta sequences that needs modifications (in the same dir as 4_canonical_yes_no.py: 5_compare_fasta.py. Also, instead of printing to console, please write into a file (format doesn't matter) and see if we have 100% match for every sequence.

mariacuria · 2024-12-17T12:00:04Z

Update on ENSEMBL protein ID -> UniProt accession mapping:

111,176 unique ENSEMBL protein IDs were found in cBioPortal data
- 87,679 IDs were mapped using GlyGen's human_protein_transcriptlocus.csv -> ensp_to_uniprot.json
- 22,854 IDs were mapped using UniProt API -> gffutils_ensp_to_uniprot_mappings.json
  - 12,433 correspond to UniProt canonical accessions
  - 6,426 non-canonical
  - Need to check what happened to 3,995 remaining IDs
- 643 IDs remain unmapped
Filtering out non-canonical accessions: done.

mariacuria · 2024-12-18T13:33:42Z

Filter canonical accessions (4_canonical_yes_no.py):

ensp_to_uniprot.json > ensp_to_uniprot_canonical.json
gffutils_ensp_to_uniprot_mappings.json > ensp_to_uniprot_canonical_gffutils.json

mariacuria · 2024-12-18T14:28:15Z

Checking why only 87,679 out of 111,176 ENSP IDs were found in GlyGen's human_protein_transcriptlocus.csv

mariacuria self-assigned this Nov 14, 2024

mariacuria mentioned this issue Nov 18, 2024

Liftover hg19 -> hg38 #17

Closed

mariacuria added the enhancement New feature or request label Nov 18, 2024

mariacuria assigned Reeya123 Nov 27, 2024

mariacuria unassigned Reeya123 Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RefSeqMrnaId -> UniProt accession mapping #16

RefSeqMrnaId -> UniProt accession mapping #16

mariacuria commented Nov 14, 2024

mariacuria commented Nov 15, 2024

mariacuria commented Nov 21, 2024 •

edited

Loading

mariacuria commented Nov 22, 2024 •

edited

Loading

mariacuria commented Nov 27, 2024 •

edited

Loading

mariacuria commented Dec 17, 2024 •

edited

Loading

mariacuria commented Dec 18, 2024 •

edited

Loading

mariacuria commented Dec 18, 2024

RefSeqMrnaId -> UniProt accession mapping #16

RefSeqMrnaId -> UniProt accession mapping #16

Comments

mariacuria commented Nov 14, 2024

mariacuria commented Nov 15, 2024

mariacuria commented Nov 21, 2024 • edited Loading

mariacuria commented Nov 22, 2024 • edited Loading

mariacuria commented Nov 27, 2024 • edited Loading

mariacuria commented Dec 17, 2024 • edited Loading

mariacuria commented Dec 18, 2024 • edited Loading

mariacuria commented Dec 18, 2024

mariacuria commented Nov 21, 2024 •

edited

Loading

mariacuria commented Nov 22, 2024 •

edited

Loading

mariacuria commented Nov 27, 2024 •

edited

Loading

mariacuria commented Dec 17, 2024 •

edited

Loading

mariacuria commented Dec 18, 2024 •

edited

Loading