-
Notifications
You must be signed in to change notification settings - Fork 15
Searching for missing genes and identifying causes for absence with tblastn
tBLASTn is a powerful tool that searches for proteins in a nucleotide sequence (such as the sequence of a whole contig) by translating it in all six possible frames (3 per strand) and then running a BLASTP against all of those translated regions. It has been used in tools such as GenePRIMP to identify and try to fix missing genes. However, it is useful to be able to identify such problems on-the-fly, and the tBLASTn program does not natively attempt to identify whether the hit regions correspond to any called proteins or (if so) how much overlap they have.
We have included a program called db_TBlastN_wrapper.py that solves these problems by generating tBLASTn databases for contigs on the fly (from a specified set of organisms to run against), running tBLASTn against a set of input proteins and looking for overlaps with existing called genes.
As an example, we searched for ribosomal protein L20 in Acetobacterium woodii (which does not have an annotated copy in the RefSeq annotation of Acetobacterium woodii), organism ID 931626.1, using the C. beijerinckii gene (Cbei_1571) as a seed.
$ db_getGenesWithAnnotation.py "L20" | grep "Cbei" | db_TBlastN_wrapper.py -o 931626.1 > tblastn_results
fig|290402.1.peg.1555 359 931626.1.NC_016894.1 Acetobacterium woodii DSM 1030 2362582 2362229 353 98.33 1e-36 132 -3 NOGENE TBLASTN_CONTIG_931626.1.NC_016894.1_START_2362582_STOP_2362229
This indicates that there is a strong hit with length 353 (to a protein with length 359) in a region with no called gene.
On the other hand, running it against C. novyi (organism ID 386415.1) finds a strong hit with 99.16% overlap with a called gene (fig|386415.1.peg.862) with a predicted function as an L20 ribosomal protein:
$ db_getGenesWithAnnotation.py "L20" | grep "Cbei" | db_TBlastN_wrapper.py -o 386415.1
fig|290402.1.peg.1555 359 386415.1.NC_008593.1 Clostridium novyi NT 965021 965377 356 99.16 2e-54 183 2 SAMESTRAND fig|386415.1.peg.862 50S ribosomal protein L20_YP_877836.1_NT01CX_1763_rplT 359 99.16 TBLASTN_CONTIG_386415.1.NC_008593.1_START_965021_STOP_965377
It is important to check four values here to ensure robustness of the interaction between the called gene and the tBLASTn hit: the overlap percentage between the hit and the called gene (99.16%), the length of the called gene (359), the length of the tBLASTn hit (359) and the length of the query gene (359). If the latter three are all close in magnitude and the overlap is high, it is almost certainly a correct call (you can also look at the E-value, 2E-54, and the bit score, 183, to help determine whether you believe it or not). However, often the overlap percentage will be low due to mistaken start site calls in the annotation or other errors, in which case the gene association cannot be trusted.
If more than one overlapping gene is found for a single tBLASTn hit, one row is created for each overlapping gene. Only overlaps > 1% of the called gene are printed by default. This cutoff can be modified using the -a flag. If the overlap between a particular hit and a called gene is less than the cutoff, INSUFFICIENT_OVERLAP is printed in place of the overlapping gene information.
We can get the amino acid sequence for the uncalled region using the file we created for tBLASTn results against Acetobacterium woodii above:
$ cat tblastn_results | db_getSequencesFromBlastResults.py -c -i 3 -s 5 -e 6 -t
We will describe the db_getSequencesFromBlastResults.py function more in the tutorial "Extracting DNA and amino acid sequences from a region of a genome, gene or protein"
This command appends the following sequence to the tBLASTn results table:
IMRIKKGVNAKKKHKKVLKLAKGFYGAKSKLYRSANEAVMRAQRSSYVGRKEKKRNFRRLWITRINAGARMYDLSYSKFMFGLKQAGVEIDRKILADLAMNDINAFKDLVEVSKKNLN
The sequence can easily be added to an alignment of L20 proteins to confirm that it is very similar along the entire length.
NOTE - This function is still somewhat of a work in progress, it gives way more false positives than we would expect due to some design issues that I need to address.
ITEP script db_findBadMutationsFromTblastn.py include crude ways to identify putative frameshift mutations, insertions and inversions with tBLASTn by identifying multiple tBLASTn hits with the same query gene but hits to different locations on the genome that are close to each other. Note that this definition can also lead to false positives if for example a gene is duplicated and the second copy is very close to the first, or multiple distantly related proteins (but still closely enough to show up in a tBLASTn search) are found in close proximity. The definitions are defined in the help text and repeated here (the user can specify the required distances)
If two hits are found within INTERVAL base pairs of each other (where INTERVAL is set by the user)
- Frameshift: The two hits are on the same strand on different frames
- Insertion: The two hits are on the same strand on the same frame
- Inversion: The two hits are on different strands
The script also identifies putative nonsense mutations if a single hit has a stop codon within a user-defined percentage of the total length of the hit.
As an example, when we ran tBLASTn with the 30S ribosomal protein S2 from Acetobacterium woodii as the query, we found that the homologous protein in C. novii is slightly shorter and that the hit goes about 10 amino acids past the stop codon, indicating a possible (small) gene truncation event or a sequencing error at that position.
$ echo "fig|931626.1.peg.1904" | db_TBlastN_wrapper.py -f ../organisms > ribosomal_tblastn
fig|931626.1.peg.1904 779 931626.1.NC_016894.1 Acetobacterium woodii DSM 1030 2300191 2300967 776 99.61 1e-169 523 1 SAMESTRAND fig|931626.1.peg.1904 30S ribosomal protein S2_YP_005269607.1_Awo_c19390_rpsB 779 99.61 TBLASTN_CONTIG_931626.1.NC_016894.1_START_2300191_STOP_2300967
fig|931626.1.peg.1904 779 386415.1.NC_008593.1 Clostridium novyi NT 1356578 1355847 731 93.84 2e-109 350 -2 SAMESTRAND fig|386415.1.peg.1248 30S ribosomal protein S2_YP_878222.1_NT01CX_2149_rpsB 701 100.00 TBLASTN_CONTIG_386415.1.NC_008593.1_START_1356578_STOP_1355847
fig|931626.1.peg.1904 779 290402.1.NC_009617.1 Clostridium beijerinckii NCIMB 8052 1409826 1410524 698 89.60 3e-105 338 3 SAMESTRAND fig|290402.1.peg.1180 30S ribosomal protein S2_YP_001308326.1_Cbei_1188_rpsB 701 99.57 TBLASTN_CONTIG_290402.1.NC_009617.1_START_1409826_STOP_141052
queryid, querylen, subcontig, organism, tblaststart, tblastend, tblastlen, queryoverlappct, evalue, bitscore, hitframe, strandedString, targetgeneid, targetannotation, targetgenelen, targetoverlappct, TBLASTN_hitID
Note that the C. novyi gene is only 701 base pairs but the tBLASTn hit was 731 bases. Searching for issues with the db_findBadMutationsFromTblastn.py reveals that there is a potential nonsense mutation here:
$ cat ribosomal_tblastn | db_findBadMutationsFromTblastn.py
386415.1.NC_008593.1 fig|931626.1.peg.1904 1356578 1355847 1356578 1355847 NONSENSE
We can get the sequence from this (since it has a contig ID, start and stop location) using db_getSequencesFromBlastResults. We use -t to turn the DNA sequence into an amino acid sequence.
$ cat ribosomal_tblastn | db_findBadMutationsFromTblastn.py | db_getSequencesFromBlastResults.py -c -i 1 -s 3 -e 4 -t
386415.1.NC_008593.1 fig|931626.1.peg.1904 1356578 1355847 1356578 1355847 NONSENSE MSIISMKQLLEAGVHFGHQTRRWNPKMAPYIFTERNGIYIIDLQKTVKKVEEAYEFVKSVVADGKEVLFVGTKKQAQEAIEEESLRSGMHFVNNRWLGGMLTNFKTIKTRINKLEQLEKMEEDGTFEVLPKKEVIKLRNEKEKLEKNLGGIKNLDASNLGAIFIVDPRKEKNAIDEAKNLGIPVVAIVDTNCDPDEIDYVIPGNDDAIRAVRLITSKIADAIIEGNQGEQLAE*FKINLLGVCD
Note the "*" 11 amino acids prior to the end of the tBLASTn hit indicating a stop codon, which is less than 10% of the protein. We can tell db_findBadMutationsFromTblastn.py to ignore stop codons that are too close to the end of the protein using the -n flag. E.g. if we don't care about the stop codon if it is less than 5% of way into the protein we use:
$ cat ribosomal_tblastn | db_findBadMutationsFromTblastn.py -n 5
This returns nothing because the stop codon is less than 5% of the way into the protein.