This repository contains the code for implementing Sacha Raffaud's IRP project titled "Diffusion Models for Optimised Geometry Prediction".
In this project, transition state optimisation involves generating accurate 3D representations of transition states. This is achieved by using reactant and product coordinates along with atom types. Optionally, reaction graphs can also be used as input, currently available with the initial W93 Dataset.
EGNNs are graph neural networks (GNN) that maintain equivariance to transformations. This means that the output of the network transforms in the same manner as the input when transformed prior to feeding it into the GNN.
After establishing a denoising framework, it can be applied iteratively in a diffusion model. In diffusion models, the process involves a forward diffusion, where noise is incrementally added to the input sample until it conforms to an isotropic Gaussian distribution. This is followed by a learnable backward diffusion, where noise is subtracted to reconstruct the original data. This iterative backward process enables high-quality data generation and is what we are trying to optimise to allow for accurate TS generation.
For detailed background and results, please refer to Sacha's thesis.
- Adapt an EGNN network for denoising.
- Implement this network within an equivariant diffusion process.
- Enhance reaction/product information with additional chemical context.
- Conduct ablation studies on the necessity of product information.
- Explore the supplementation/replacement of reaction and product information with reaction graphs.
Three main datasets were employed in this project for comprehensive comparisons: W93, TX1, and RGD1 datasets. All datasets used DFT to generate the transition states.
W93 Dataset - Elementary Reactions of Gas-Phase Compounds 1
- Initial reaction conformation dataset with transition states.
- Around 12,000 high precision samples.
- Used with the TS-Diff model.
TX1 Dataset - Transition X Dataset 2
- Built upon W93 with re-optimised transition states.
- Represents an improved version of W93.
- Used with the OA-ReactDiff model.
RGD1 Dataset - Reaction Graph Depth One Dataset 3
- New dataset with over 176,000 samples.
- Features multiple transition state conformations.
- Includes larger molecules and offers new insights.
The primary dataset used is W93, comprehensively tested with PyTest. The other datasets have limited tests due to large .h5 files.
The main source files are located in the src
directory, containing the following essential files:
train_test.py
: Script for training and testing diffusion models.lightning_setup.py
: PyTorch Lightning class for diffusion models.evaluate_samples.py
: Script for evaluating generated samples.
Subdirectories within src
include Diffusion
and EGNN
, housing appropriate backbones for respective models. All dataset classes and setup files are in the data
directory.
Documentation can be found in the docs
directory within a file called TS-DiffuGen-Documentation.pdf
.
To facilitate seamless model training and testing, all operations are conducted through configuration files. Below is a brief overview of the various parameters that can be used within each diffusion model:
train_test
: Choose between 'Train' or 'Test': This directive controls whether a diffusion model should be trained or tested.use_graph_in_model
: Boolean: Determines whether a reaction graph should be integrated into the model.dataset_to_use
: Choose between 'W93', 'TX1', or 'RGD1': Specifies the dataset for training/sampling.timesteps
: int: Dictates the number of diffusion steps to employ.noise_schedule
: Choose between 'sigmoid_2', 'sigmoid_5', or 'cosine': Designates the noise schedule to employ.remove_hydrogens
: Boolean: Determines whether hydrogens should be included.random_rotations
: Boolean: Dictates whether random rotations should be applied during training.augment_train_set
: Boolean: Controls whether the training set should be augmented by replacing reactants with products.include_context
: Choose between None, 'Nuclear_Charges', 'Activation_Energy', or 'Van_Der_Waals': Specifies the type of context to incorporate.remove_product
: Boolean: Controls whether product coordinates are incorporated.lr
: float: Learning rate control.epochs
: int: Number of training epochs.batch_size
: int: Batch size for processing.learning_rate_scheduler
: Boolean: Determines whether a scheduler should be utilized for learning rate.model_name
: str: Model name for identification with WandB during training.folder_name
: str: Name of the folder within thetrained_models
directory.wandb_project_name
: str: Name of the project in WandB.n_layers
: int: Number of EGNN layers in the diffusion model.hidden_features
: int: Size of the embedding for hidden node features.
Example configuration files are available in the configs
directory.
Each Python file and script includes a header at the top, providing information about its contents. Additionally, there's a reference to any adaptations made from previous codebases.
Note: A GPU is HIGHLY recommended as these diffusion models are compute intensive.
The TS-DiffuGen package has been thoroughly tested on the following system configuration:
Operating System | Python Version | PyTorch Lightning Version | PyTorch CUDA Version |
---|---|---|---|
Ubuntu | 3.10.10 | 2.7.7 | 2.0.1+cu117 |
- Clone the repository
git clone https://github.com/schwallergroup/TS-DiffuGen.git
- Navigate to the root repository:
cd TS-DiffuGen
- Create the Conda environment (~ 4 minutes due to large imports):
conda env create -f environment.yml
- Activate the environment:
conda activate tsdiff
- Create the package:
python setup.py install
- Download the datasets and enjoy the package!
To set up the W93 dataset, follow these steps:
-
Download the compressed Tar file
wb97xd3.tar.gz
from the following link: W93 Dataset Link -
Place the downloaded file in the
data/Dataset_W93/data/w93_dataset/
directory. -
Create a
TS
directory with the following command:mkdir data/Dataset_W93/data/TS
-
Uncompress the .tar file into the TS directory using the following command:
tar -xvf data/Dataset_W93/data/w93_dataset/wb97xd3.tar.gz -C data/Dataset_W93/data/TS/
-
Run the
setup_dataset_files.py
script to process and organise the dataset using the following command (Will take ~ 3 minutes):python data/Dataset_W93/setup_dataset_files.py
To set up the TX1 dataset, follow these steps:
- Download the Transition1x.h5 file from the following link: TX1 Dataset Link
- Place the file in the
data/Dataset_TX1
directory
To set up the RGD1 dataset, follow these steps:
-
Download the RGD1_CHNO.h5 file from the following link: RGD1 Dataset Link
-
Place the file in the
data/Dataset_RGD1
directory -
Run the
parse_data.py
script with the following command:python data/Dataset_RGD1/parse_data.py
The training of diffusion models is enhanced with the integration of Weights and Biases (WandB). WandB enables real-time, continuous monitoring of the training process. If WandB is not yet setup in your environment, when you train your first diffusion model, you will be asked to input a WandB API key.
If you do not already have an account, you can make one through the following link: WandB
-
Change the parameters in the
configs/train_diffusion.yml
configuration file to your liking. -
Run the following command to train a new diffusion model:
python src/train_test.py --config configs/train_diffusion.yml
Similarly to when training, a testing config file should be adapted which points to the location of the trained diffusion model, includes the number of samples it should generate and matches the parameters used in the training process. Samples from the test set can then be generated using the following command:
python src/train_test.py --config configs/test_diffusion.yml
The pre-trained_graph
model was trained with the following parameters:
- Uses Reaction Graphs
- Does not use Product Coordinates
- Sampling Steps: 1,000
- EGNN Layers: 8
- Hidden Features: 64
- Noise Schedule: Sigmoid_2
The pre-trained_simple
model was trained with the following parameters:
- Dataset: TX1
- Without Reaction Graphs
- Uses Product Coordinates
- Sampling Steps: 2,000
- EGNN Layers: 8
- Hidden Features: 64
- Noise Schedule: Sigmoid_2
You can use these pre-trained models to generate samples using the provided commands:
-
Run the following script with the chosen testing config file:
python src/train_test.py --config configs/test_pre_trained_diffusion_simple.yml
or
python src/train_test.py --config configs/test_pre_trained_diffusion_with_graphs.yml
-
Samples from the test set will be generated within the chosen model's Samples directory. This should take around 2 hours for the whole test set.
All generated samples should be created within a designated Samples
directory. A path to this directory will be used within the evaluation script.
The evaluation script calculates the COV (Coverage) and MAT (Matching) scores for the generated samples, using thresholds of 0.1 and 0.2 Å. Formulas for these metrics can be found in Sacha's thesis.
The evaluation script can be executed with the following command:
python src/evaluate_samples PATH_TO_SAMPLES_DIRECTORY
This command will trigger the evaluation process and display the computed COV and MAT scores on the screen.
To use PyMol for visualisation:
- Ensure PyMol is installed on your desktop along with the appropriate license. You can download it from this link.
- Place the PyMol script in the designated folder and execute it using the PyMol GUI.
Testing for this project has been conducted using the PyTest framework. Thorough testing has been performed on the W93 dataset. Testing for the other two datasets requires the download of .h5
files.
All critical functions, classes, and methods from various scripts have been rigorously tested and are located in the /tests
directory. These tests are expected to pass successfully as part of the repository's workflow. If you intend to run these tests on your local machine, execute the following command:
pytest tests/
Footnotes
-
Grambow, C. A., Pattanaik, L., & Green, W. H. (2020). "Reactants, products, and transition states of elementary chemical reactions based on quantum chemistry." 2020. Link ↩
-
Schreiner, M., Bhowmik, A., Vegge, T., Busk, J., & Winther, O. (2022). Transition1x - a dataset for building generalisable reactive machine learning potentials. Scientific Data, 9(1), 779. Link ↩
-
Zhao, Q., Vaddadi, S. M., Woulfe, M., Ogunfowora, L. A., Garimella, S. S., Isayev, O., & Savoie, B. M. (2023). Comprehensive exploration of graphically defined reaction spaces. Scientific Data, 10(1), 145. Link ↩
-
Satorras, V. G., Hoogeboom, E., & Welling, M. "E(n) Equivariant Graph Neural Networks." February 2021. arXiv. ↩
-
Hoogeboom, E., Satorras, V. G., Vignac, C., & Welling, M. "Equivariant Diffusion for Molecule Generation in 3D." March 2022. arXiv. ↩