-
Notifications
You must be signed in to change notification settings - Fork 0
TP10_Searching_databases_using_BLAST\TP10_Searching_databases_using_BLAST
BLAST (Basic Local Alignment Search Tool) is perhaps the Google search of biological sequences. It is probably the most widely used bioinformatics programs for sequence searching. The heuristic algorithm it uses is much faster than for example dynamic programming approaches, such as the Smith-Waterman optimal alignment algorithm. This emphasis on speed is necessary to making the algorithm practical on the huge genome databases currently available.
The most commonly used BLAST service is the one offered by NCBI to search the Genbank database. It can be found here. Select “BLAST” from the list on the right side of the screen (Fig 1).
*
You should be seeing a page similar to the one in Fig 2. There are four options, Nucleotide BLAST, blastx, tblastn and Protein BLAST. Although almost all original data in Genbank is DNA and RNA sequences, the database also contain In-Silico translated protein sequences. If we have a protein, and we are interested in similar proteins, the Protein BLAST is the most convenient. However, if we have a DNA or RNA sequence, we can find similar encoded proteins by blastx.
*
Select the Protein BLAST on the right of Fig 2. Use the Protein BLAST to search using the human protein sequence below:
>NP_061820.1
MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLE
NPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
The results from BLAST are always organized as a list with the most significant similarities in the beginning and the less similar sequences (also called “Hits”) in the end.
There are four tabs (Descriptions, Graphic Summary, Alignments and Taxonomy) with different kinds of information about the same results (Fig 3). Each result in the Descriptions is associated with seven categories of information, see table below:
Description | The description line from the database |
---|---|
Max score | The alignment score of the best match (local alignment) between the query and the database hit |
Total score | The sum of alignment scores for all matches (alignments) between the query and the database hit (if there is only one match per hit, these two scores are identical) |
Query cover | The percentage of the query sequence that is covered by the alignment(s) |
E value | The Expect value calculated from the Max score (i.e. the number of hits with that score or better you would expect to find for random reasons) |
Per. Ident | The percent identity in the alignment(s) |
Accession | The accession number of the database hit. |
The Description might tell us what the similar gene is called, in this case “Homo sapiens cytochrome c”. Cytochrome c is a protein in the mitochondrial and bacterial electron transport chain.
What is the Expect or E value?
This value can be understood as the chance of finding a similar alignment purely by chance. If we search a database of a certain size that consists of only random sequences, we would find alignments that depend on chance alone. This chance increases with the size of the database and decreases with the length of the query sequence.
A comparison can be made with the so called “Bible code” where words has been extracted from the text of the Bible by for example extracting every 50th letter (Fig 4).
Statisticians have proved that if the text is sufficiently long (like the Bible or some other long text) short words or phrases are bound to appear by chance. This is why the E-value is important in judging the significance of the alignment. Watch the two videos (about 5-6 min together) in Fig 5 and Fig 6.
Question 1
Set the organism filter the database to Chimpanzee (Fig 7.). Redo the search using the same query sequence as before.
How similar are the human and Chimpanzee protein sequences?
Question 2
Now go back and change the filter to Saccharomyces cerevisiae (taxid:4932) and redo the analysis. Which of the cytochrome c sequences are most similar to each other?
Protein Blast of human Caspase-9 against Saccharomyces cerevisiae
If you do not find any highly similar results, you can draw the conclusion that the type of protein (family) that the sequence represent does not exist in the analyzed organism. This kind of conclusion is of course more robust if more proteins from the same family are tested.
The protein NP_127463 is the human caspase-9, a protein involved in apoptosis or programmed cell death. Make a Protein BLAST search filtering for Saccharomyces cerevisiae (taxid:4932). Tip! You can enter the accession number in the Query sequence window Fig 8. For this example, you also have to set the Expect threshold to 1000.
Question 3
Are there any similar proteins in Saccharomyces? Is the E-value for the gene you found higher or lower than the one for cytochrome c?
Protein Blast of human APAF-1 against Saccharomyces cerevisiae
The APAF-1 protein is another protein also involved in apoptosis or programmed cell death in humans. The accession number of the protein sequence is O14727 (Genbank).
Make the same kind of analysis as before, filtering for Saccharomyces cerevisiae (taxid:4932).
Question 4
Is the E- value for the best hit for APAF-1 in the Saccharomyces cerevisiae sequences lower or higher than the limit used internally at NCBI (given in videos Fig 5 and Fig 6)? Do we have a probable APAF-1 homolog in the Saccharomyces cerevisiae genome?
We should be careful when inferring homology between APAF-1 and the proteins in S. cerevisiae, since the similarity only seem to be valid for the c-terminal of the protein (Fig 9). We can see that the similarity between APAF-1 and the Query starts at about 600 and goes on to about 1200 aminoacids. If we look at the figure, there is a category of results called “Specific hits” and “Super families”.
We can see that there are two features in this part of the protein called “WD40”. WD40 is a protein motif that is shared among different protein families and it is thought that the common function of all WD40-repeat proteins is coordinating protein complex assemblies.
This means that although the low E-value might suggest very similar sequences, the details of the similarity means that this similarity is weak evidence for homology.