Skip to content

Commit

Permalink
Update AlphaFold2_how_to_guide.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Mhealy9999 authored Sep 18, 2024
1 parent b8bb596 commit c340b8a
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions protein_struct_pred/AlphaFold2_how_to_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Levinthal’s paradox: “finding the native folded state of a protein by a rand

It is within these two statements that we are introduced to the two problems in computational structure prediction, i) the structure prediction problem and ii) the structure folding problem. AF2 only addresses the first of these two problems and does not take into account the biophysical properties of individual atoms.

With this in mind the next question is, how did Google DeepMind (the company behind AF2) crack the structure prediction problem? In essence they were able to do this by looking at the co-evolution of amino acid residues. We can understand this with a simple thought experiment (illustrated below). Take two amino acids in a given protein sequence, one that is positively charged (red circles), and one negatively charged (blue circles). If these two amino acids form an interacting pair together in the protein structure, then we expect this interaction to be maintained over evolutionary time. The order of the two amino acids in the sequence can be swapped, but the overall interaction between the two will remain the same. For instance, if the positively charged amino acid is mutated to become a negatively charged amino acid, then a reciprocal mutation in its partner (i.e. mutation from a negative to a positively charged amino acid) is likely to be selected during evolution to maintain the overall interaction. In other words, if amino acids interact, they are likely to co-evolve. Likewise if two residues were unrelated to each other in the structure, we would not expect to see them co-evolve (orange and green circles). By performing very deep sequence alignments it is possible to identify patterns of co-evolution between amino acids. AF2 detects these patterns to predict which amino acids are interacting, and therefore predict which amino acids are close together in 3D space. Amazingly, AF2 is able to predict protein structures with just sequence information, and has no explicit “knowledge” of how biochemistry or protein folding works. It is worth noting that while protein structures from the PDB were used as training data for AF2, this data was used to validate AF2’s accuracy. and AF2 does not normally use protein structure data when performing a new prediction (note: there are settings you can change to adjust this).
With this in mind the next question is, how did Google DeepMind (the company behind AF2) crack the structure prediction problem? In essence they were able to do this by looking at the co-evolution of amino acid residues. We can understand this with a simple thought experiment (illustrated below). Take two amino acids in a given protein sequence, one that is positively charged (red circles), and one negatively charged (blue circles). If these two amino acids form an interacting pair together in the protein structure, then we expect this interaction to be maintained over evolutionary time. The order of the two amino acids in the sequence can be swapped, but the overall interaction between the two will remain the same. For instance, if the positively charged amino acid is mutated to become a negatively charged amino acid, then a reciprocal mutation in its partner (i.e. mutation from a negative to a positively charged amino acid) is likely to be selected during evolution to maintain the overall interaction. In other words, if amino acids interact, they are likely to co-evolve. Likewise if two residues were unrelated to each other in the structure, we would not expect to see them co-evolve (orange and green circles). By performing very deep sequence alignments it is possible to identify patterns of co-evolution between amino acids. AF2 detects these patterns to predict which amino acids are interacting, and therefore predict which amino acids are close together in 3D space. Amazingly, AF2 is able to predict protein structures with just sequence information, and has no explicit “knowledge” of how biochemistry or protein folding works. It is worth noting that while protein structures from the PDB were used as training data for AF2, this data was only used to validate AF2’s accuracy and AF2 does not normally use protein structure data when performing a new prediction (note: there are settings you can change to adjust this).

![](images/AF2_How-to_images/Image-03.png)

Expand All @@ -60,7 +60,7 @@ For the purposes of this tutorial I will be using the [ColabFold](https://colab.
| A100 (mins) | 3 | 7 | 30 |
| Structure | ![](images/AF2_How-to_images/29_scaled.gif) | ![](images/AF2_How-to_images/17_scaled.gif) | ![](images/AF2_How-to_images/retriever_scaled.gif) |

As residue number increases there is a big divergence in time requirements so if you are just doing a few quick runs of a short protein, T4 GPUs will work. However if you want to run lots of proteins or longer proteins I would highly recommend getting a subscription (alternatively you can register with the [Australian Alphafold Service](https://www.biocommons.org.au/alphafold#:~:text=AlphaFold%20is%20an%20artificial%20intelligence,from%20its%20amino%20acid%20sequence.), established by the Australian Biocommons and hosted by Galaxy Australia to access a full HPC supported version. A help file on how to run an AF2 job on galaxy can be found here) . If you opt to get a subscription you need to tell Colab you would like to use a A100, to do this you can click on the arrow next to ![](images/AF2_How-to_images/change-runtime.png) and select “Change runtime type”.
As residue number increases there is a big divergence in time requirements so if you are just doing a few quick runs of a short protein, T4 GPUs will work. However if you want to run lots of proteins or longer proteins I would highly recommend getting a subscription (alternatively you can register with the [Australian Alphafold Service](https://www.biocommons.org.au/alphafold#:~:text=AlphaFold%20is%20an%20artificial%20intelligence,from%20its%20amino%20acid%20sequence.), established by the Australian Biocommons and hosted by Galaxy Australia to access a full HPC supported version. A help file on how to run an AF2 job on galaxy can be found here) . If you opt to get a subscription you need to tell Colab you would like to use an A100, to do this you can click on the arrow next to ![](images/AF2_How-to_images/change-runtime.png) and select “Change runtime type”.

Now back to how to run something on ColabFold. ColabFold works as a series of modules.

Expand Down Expand Up @@ -115,7 +115,7 @@ Looking at this plot we can see that AlphaFold2 is confident about the large maj

![](images/AF2_How-to_images/Image-08.png)

The final plot to look at if we are going to understand the structure prediction is the PAE, or predicted alignment error plot. This is a measure of the confidence of the relative positioning of amino acids within a structure. The first thing to notice is that you will always have a blue line along the diagonal of these plots as each residue in the structure has 0 relative positioning error to itself and the residues immediately adjacent to it (the orange box in Figure at the C-terminus is a good example where this is very clear, however you can see the dark blue line running diagonally across the whole plot. This feature is present in ALL PAE plots). When residues fold into a structured domain there is less error in the relative positioning of all the residues within that domain and so you get the formation of blue squares (green box highlights this in Figure X). Finally if there is an interaction between two separate strucutral domains or two regions of a protein, or even two separate proteins forming a complex, you will see low relative error on the diagonals corresponding to the regions that interact (purple box in Figure X shows an example of this interaction from the large structured domain from residues 1-\~400 and a small region right at the C-terminus). In this particular case the interaction along the diagonal proved to be quite important for the function of SNX17, so if you would like to explore how these predictions can start to explain protein function you can read about that [here](https://www.nature.com/articles/s41467-024-50971-0).
The final plot to look at if we are going to understand the structure prediction is the PAE, or predicted alignment error plot. This is a measure of the confidence of the relative positioning of amino acids within a structure. The first thing to notice is that you will always have a blue line along the diagonal of these plots as each residue in the structure has 0 relative positioning error to itself and the residues immediately adjacent to it (The orange box in the figure below is a good example where this is very clear, however you can see the dark blue line running diagonally across the whole plot. This feature is present in ALL PAE plots). When residues fold into a structured domain there is less error in the relative positioning of all the residues within that domain and so you get the formation of blue squares (The greeen box in the figure below). Finally if there is an interaction between two separate structural domains or two regions of a protein, or even two separate proteins forming a complex, you will see low relative error on the diagonals corresponding to the regions that interact (The purple box in the figure below shows an example of this interaction from the large structured domain from residues 1-\~400 and a small region right at the C-terminus). In this particular case the interaction along the diagonal proved to be quite important for the function of SNX17, so if you would like to explore how these predictions can start to explain protein function you can read about that [here](https://www.nature.com/articles/s41467-024-50971-0).

![](images/AF2_How-to_images/Image-09.png)

Expand All @@ -124,17 +124,17 @@ To highlight the interpretation of PAE plots further, let's also look at *DNA d
![](images/AF2_How-to_images/Image-10.png)


You can see the model comprise two structured domains with high confidence (dark blue) and some low confidence disordered regions (orange). Now if we were just presented with this structure we might be tempted to think that these two domains (one which is at the N-terminus and one near the C-terminus) are forming an intramolecular interaction as AF has packed them very close together. But one quick look at the PAE plot (Figure X) will show that there is no data suggesting that these two domains interact together . You can see we get the line down the middle as per usual and then two boxes at either end of the protein indicating the two discrete structured domains. In particular the lack of blue in the top right corner and bottom left of the PAE plots (highlighted by the green boxes) indicates that AF2 has no confidence in the relative positioning of these two domains with respect to each other. Instead the closeness of the two domains is likely an artefact of a volume minimisation function within AF used to increase computational speed, or to put it another way AF2 compresses structures into the smallest volume possible to increase computational efficiency. The correct interpretation of this AF model is that this protein comprises two non-interacting structured domains connected by a long disordered region.
You can see this model is predicted with hgih confidence to contain two structured domains (dark blue) and some low confidence disordered regions (orange). Now if we were just presented with this structure we might be tempted to think that these two domains (one which is at the N-terminus and one near the C-terminus) are forming an intramolecular interaction as AF has packed them very close together. But one quick look at the PAE plot (Figure X) will show that there is no data suggesting that these two domains interact together . You can see we get the line down the middle as per usual and then two boxes at either end of the protein indicating the two discrete structured domains. In particular the lack of blue in the top right corner and bottom left of the PAE plots (highlighted by the green boxes) indicates that AF2 has no confidence in the relative positioning of these two domains with respect to each other. Instead the closeness of the two domains is likely an artefact of a volume minimisation function within AF used to increase computational speed, or to put it another way AF2 compresses structures into the smallest volume possible to increase computational efficiency. The correct interpretation of this AF model is that this protein comprises two non-interacting structured domains connected by a long disordered region.

Below is with other proteins and their PAE plots, clicking on the protein name will redirect you to the AlphaFold database which has a great interactive PAE tool. Clicking and dragging a box around an area of the PAE plot will lead to the same area being highlighted in the structure to the left. Where possible I have also included a reference to a paper which has used AlphaFold2 to describe the protein. Note the colours are slightly different with green showing low positional error and white showing high positional error.
Below is a table with other proteins and their PAE plots, clicking on the protein name will redirect you to the AlphaFold database which has a great interactive PAE tool. Clicking and dragging a box around an area of the PAE plot will lead to the same area being highlighted in the structure to the left. Where possible I have also included a reference to a paper which has used AlphaFold2 to describe the protein. Note the colours are slightly different with green showing low positional error and white showing high positional error.

| Protein | Brief description | Reference |
| :---: | :---- | :---: |
| :----: | :---- | :---: |
| [SNX13](https://alphafold.ebi.ac.uk/entry/Q9Y5W8) | A four-domain protein where the N and C terminus form an intramolecular interaction. | [https://doi.org/10.3389/fcell.2022.826688](https://doi.org/10.3389/fcell.2022.826688) |
| [PDLIM1](https://alphafold.ebi.ac.uk/entry/O00151) | A protein with two domains at the N and C-terminus with a long-disordered linker. | [https://doi.org/10.1042/BST20220804](https://doi.org/10.1042/BST20220804) |
| [VPS35](https://alphafold.ebi.ac.uk/entry/Q96QK1) | Alpha-helices stacked on top of one another, this is known as a heat repeat protein | |
| [SNX17](https://alphafold.ebi.ac.uk/entry/Q15036) | A protein with two closely associated domains and a disordered tail which makes contact with these domains. | [https://doi.org/10.1038/s41467-024-50971-0](https://doi.org/10.1038/s41467-024-50971-0) |
| [CCDC22](https://alphafold.ebi.ac.uk/entry/O60826) | This protein has a N-terminal structure domain followed by a disordered domain (although note there is some confidence in the positioning of this disorder) and a long alpha-helix. This structure of this protein becomes more rigid when incorporated into the large CCC complex. | [https://doi.org/10.1016/j.cell.2023.04.003](https://doi.org/10.1016/j.cell.2023.04.003) |
| [CCDC22](https://alphafold.ebi.ac.uk/entry/O60826) | This protein has a N-terminal structured domain followed by a disordered domain (although note there is some confidence in the positioning of this disorder) and a long alpha-helix. This structure of this protein becomes more rigid when incorporated into the large CCC complex. | [https://doi.org/10.1016/j.cell.2023.04.003](https://doi.org/10.1016/j.cell.2023.04.003) |
| [I7ME23](https://www.alphafold.ebi.ac.uk/entry/I7ME23) | A protein with two domains connected by a short flexible linker. | [https://doi.org/10.1038/s41467-023-37868-0](https://doi.org/10.1038/s41467-023-37868-0)
| [P51513](https://www.alphafold.ebi.ac.uk/entry/P51513) | A protein with three soluble domains connected via a long flexible linker. Two of the domains interact. | [https://doi.org/10.1016/j.str.2022.08.004](https://doi.org/10.1016/j.str.2022.08.004) |

Expand Down

0 comments on commit c340b8a

Please sign in to comment.