-
Notifications
You must be signed in to change notification settings - Fork 15
ITEP ID standards
Organism IDs consist of two numbers separated by a period ("."):
- The TaxID (an integer), and
- A "version number" (an integer)
The version number can be used to distinguish between different annotations of the same genome, or more commonly it is used to distinguish different genomes with the same TaxID. An example organism ID is 83333.1, which is for an organism with TaxID 83333 and version number 1.
A mapping between organism name and ID is stored in the database and also exists in the file $ITEP_ROOT/organisms . DO NOT DELETE THIS FILE.
The organism ID will always match the regex "\d+.\d+"
The gene IDs in ITEP are designed to be compatible with RAST and with PubSEED. They are generated as follows:
- If you download a genbank file from PubSEED, ITEP will use the same IDs automatically.
- ITEP will also use RAST IDs if the user uses the web interface to RAST to download the tab-delimited file. See this tutorial for details.
- Otherwise, ITEP (in particular the convertGenbank2Table.py script) will generate IDs with the format
fig|[organism_ID].peg.[Number]
Where [number] is incremented by 1 in order in which the genes appear in the Genbank file. The conversion from ITEP IDs to other IDs in the input Genbank files is automatically generated and stored in the file $ITEP_ROOT/aliases/aliases
The ITEP gene ID will always match the following regex:
fig\|\d+\.\d+\.peg\.\d+
Organism IDs can be obtained by capturing the first two numbers:
fig\|(\d+\.\d+)\.peg\.\d+
The contig name in input Genbank files is concatenated with
Contig name from Genbank file : contig1
ITEP organism ID: 83333.1
---------------
ITEP contig ID: 83333.1.contig1
This is done because contig names are often something generic like "contig1" and we want to avoid collisions of the same contig name in different organisms.
When you run the tBLASTn wrapper you will get an informative ID in this format:
TBLASTN_CONTIG_$CONTIG_START_$START_STOP_$STOP
where $CONTIG is the ITEP contig ID for the tBLASTn hit, $START is the location of the first base (1-indexed) of the tBLASTn hit within that contig and $STOP is the location of the last base. ITEP includes functions for parsing this and supports including IDs of this format in a tree. If a tBLASTn ID is included in a Newick tree, the neighborhood computation functions will automatically compute the neighborhoods for the tBLASTn hit so you can compare neighborhoods of called and uncalled genes.