This is the official codebase for methylGPT : a foundation model for the DNA methylome.
!UPDATE:
[2025.02.10] methylGPT is now available on PyPI
[2024.12.10] We made initial launching of the methylGPT codebase.
[2024.11.04] Manuscript available on biorXiv
methylGPT works with Python >= 3.9.10 and R >=3.6.1. Please make sure you have the correct version of Python and R installed pre-installation.
methylGPT is available on PyPI. To install methylGPT, run the following command:
pip install methylgpt "flash-attn<1.0.5" # optional, recommended
[Optional] We recommend using wandb for logging and visualization.
pip install wandb
For developing, we are using the Poetry package manager. To install Poetry, follow the instructions here.
$ git clone this-repo-url
$ cd MethylGPT_clean
$ poetry install
Note: The flash-attn
dependency usually requires specific GPU and CUDA version. If you encounter any issues, please refer to the flash-attn repository for installation instructions. For now, May 2023, we recommend using CUDA 11.7 and flash-attn<1.0.5 due to various issues reported about installing new versions of flash-attn.
The primary pretraining code is implemented in methylgpt.pretraining.py
. During training, model checkpoints are automatically saved to the save/
directory at the end of each epoch.
For a detailed walkthrough of the pretraining process, refer to our step-by-step examples in the pretraining tutorials.
This repository provides access to our suite of pretraining models for DNA methylation analysis. The major data sources for pretraining are derived from a comprehensive collection of human DNA methylation profiles.
We collected a total of 226,555 human DNA methylation profiles aggregated from 5,281 datasets through two complementary resources: EWAS Data Hub and Clockbase.
Data Sources | Datasets (Combined) | DNA Methylation Profiles | Description | Links |
---|---|---|---|---|
EWAS Data Hub & Clockbase (Combined) | 5,281 | 226,555 | Aggregated high-quality human DNA methylation profiles curated for pretraining purposes. | EWAS Data Hub • Clockbase |
Our current suite of pretraining models includes the following architectures:
Model | Hyperparameters | Parameters |
---|---|---|
methylGPT-tiny | emb-dim: 64, layers: 6, heads: 4 | 3M |
methylGPT-small | emb-dim: 128, layers: 6, heads: 4 | 7M |
methylGPT-normal | emb-dim: 256, layers: 6, heads: 4 | 15M |
- Recommended model: We suggest using the
methylGPT-normal
model for most applications unless computational constraints require a lighter model. - Checkpoint folders: We don't provide checkpoints yet. #Each model checkpoint is provided along with a paired vocabulary file mapping gene names to IDs.
Please see our example code in tutorials/finetuning_age_prediction.
- Upload the pretrained model checkpoint
- Publish to pypi
- Provide the pretraining code with generative attention masking
- More tutorial examples for disease prediction
- Publish to huggingface model hub
We greatly welcome contributions to methylGPT. Please submit a pull request if you have any ideas or bug fixes. We also welcome any issues you encounter while using methylGPT.
MethylGPT's backend architecture is largely based on scGPT, developed by the Wang Lab. As such, our project inherits and follows similar dependencies and architectural patterns. We acknowledge and thank the scGPT team for their foundational work.
We sincerely thank the authors of following open-source projects:
@article{ying2024methylgpt,
title={MethylGPT: a foundation model for the DNA methylome},
author={Ying, Kejun and Song, Jinyeop and Cui, Haotian and Zhang, Yikun and Li, Siyuan and Chen, Xingyu and Liu, Hanna and Eames, Alec and McCartney, Daniel L and Marioni, Riccardo E and others},
journal={bioRxiv},
pages={2024--10},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}