Implement, train, tune, and evaluate a transformer model for antibody classification with this step-by-step code.
Read on Medium
·
Read on Substack
·
Report Bug
·
Request Feature
Table of Contents
This project provides a step-by-step guide to implementing a transformer model for protein data, covering training, hyperparameter tuning, and evaluation.
Highlights
- Hands-on Transformer Implementation: Follow along with code to build a transformer-based antibody classifier.
- Optimize Performance: Explore hyperparameter tuning techniques to improve the model's accuracy.
- Evaluation: Assess the model's generalization ability and gain insights into its performance on a hold-out test dataset.
Clone the repo:
git clone https://github.com/naity/protein-transformer.git
The requirements.txt
file lists the Python packages that need to be installed in order to run the scripts. Please use the command below for installation.
pip install -r requirements.txt
In this project, we will implement, train, optimize, and evaluate a transformer-based model for antibody classification. The data has been preprocessed, formatted as a binary classification problem with a balanced number of samples in each class. Processed datasets are stored in the data/
directory: bcr_train.parquet
is used for training and tuning, while bcr_test.parquet
is the hold-out test dataset. For details on the preprocessing steps, please refer to the notebooks/bcr_preprocessing.ipynb
notebook.
1. Running the train.py
Script
See the table below for key parameters when running the train.py
script. For a full list of options, run:
python protein_transformer/train.py --help
Parameter | Description | Default |
---|---|---|
--run-id | Unique name for the training run | None (Required) |
--dataset-loc | Path to the dataset in parquet format | None (Required) |
--val-size | Proportion of the dataset for validation | 0.15 |
--embedding-dim | Dimensionality of token embeddings | 64 |
--num-layers | Number of Transformer encoder layers | 8 |
--num-heads | Number of attention heads in the encoder | 2 |
--ffn-dim | Dimensionality of the feed-forward layer in the encoder | 128 |
--dropout | Dropout probability for regularization | 0.05 |
--batch-size | Number of samples per batch for each worker | 32 |
--lr | The learning rate for the optimizer | 2e-5 |
--num-epochs | Number of epochs for training | 20 |
For example, to execute the training script with default parameters and store the results under a run ID named train01
, use the following command:
python protein_transformer/train.py --run-id train01 --dataset-loc data/bcr_train.parquet
Upon completion, the script stores training results in the runs/train01
directory by default. This includes model arguments, the best-performing model (based on validation loss), training and validation loss records, along with validation metrics for each epoch. These metrics, which include the following, are saved in the runs/train01/results.csv
file:
Accuracy: 0.727
AUC score: 0.851
Precision: 0.734
Recall: 0.727
F1-score: 0.725
2. Running the tune.py
Script
See the table below for key parameters when running the tune.py
script. For a full list of options, run:
python protein_transformer/tune.py --help
Parameter | Description | Default |
---|---|---|
--run-id | Unique name for the hyperparameter tuning run | None (Required) |
--dataset-loc | Absolute path to the dataset in parquet format | None (Required) |
--val-size | Proportion of the dataset for validation | 0.15 |
--num-classes | Number of final output dimensions | 2 |
--batch-size | Number of samples per batch for each worker | 32 |
--num-epochs | Number of epochs for training (per trial) | 30 |
--num-samples | Number of trials for tuning | 100 |
--gpu-per-trial | Number of GPUs to allocate per trial | 0.2 |
- Note: The --dataset-loc parameter must be specified as an absolute path.
For example, to initiate the tuning process with default parameters and store the results under a run ID named tune01
, execute the tune.py
script from the project root directory:
python protein_transformer/tune.py --run-id tune01 --dataset-loc /home/ytian/github/protein-transformer/data/bcr_train.parquet
By default, it will execute 100 trials with different parameter combinations, running each trial for up to 30 epochs. Ray Tune utilizes early stopping for unpromising trials, allowing for efficient exploration of the hyperparameter space and focuses resources on better-performing configurations. It will track the results of each trial, and upon completion, the best-performing model based on validation loss will be saved in the runs/tune01
directory by default. Additionally, tuning logs, including results from each trial, are stored within the same runs/tune01
directory for easy access and analysis.
3. Running the evaluate.py
Script
See the table below for key parameters when running the evaluate.py
script. For a full list of options, run:
python protein_transformer/evaluate.py --help
Parameter | Description | Default |
---|---|---|
--run-dir | Path to the output directory for a training or tuning run | None (Required) |
--dataset-loc | Path to the test dataset in parquet format | None (Required) |
--batch-size | Number of samples per batch | 64 |
For example, to evaluate the best model from the tune01
run on the hold-out test dataset, execute the following command from the command line:
python protein_transformer/evaluate.py --run-dir runs/tune01 --dataset-loc /home/ytian/github/protein-transformer/data/bcr_test.parquet
Upon completion, the script will save test metrics in a file named test_metrics.json
, like the following example, within the run directory provided in the evaluate.py
command:
Accuracy: 0.761
AUC score: 0.837
Precision: 0.761
Recall: 0.761
F1-score: 0.761
- Data Processing
- Model Implementation
- Training
- Hyperparameter Tuning
- Evaluation
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the Apache License. See LICENSE.txt
for more information.