NVIDIA BioNeMo Framework 2.0
New Features:
- ESM2 implementation
- State of the art training performance and equivalent accuracy to the reference implementation
- 650M, and 3B scale checkpoints available which mirror the reference model
- Flexible fine-tuning examples that can be copied and modified to accomplish a wide variety of downstream tasks
- First version of our NeMo v2 based reference implementation which re-imagines bionemo as a repository of megatron models, dataloaders, and training recipes which make use of NeMo v2 for training loops.
- Modular design and permissible Apache 2 OSS licenses enables the import and use of our framework in proprietary applications.
- NeMo2 training abstractions allows the user to focus on the model implementation while the training strategy handles distribution and model parallelism.
- Documentation and documentation build system for BioNeMo 2.
Known Issues:
- PEFT support is not yet fully functional.
- Partial implementation of Geneformer is present, use at your own risk. It will be optimized and officially released in the future.
- Command line interface is currently based on one-off training recipes and scripts. We are working on a configuration based approach that will be released in the future.
- Fine-tuning workflow is implemented for BERT based architectures and could be adapted for others, but it requires you to inherit from the biobert base model config. You can follow similar patterns in the short term to load weights from an old checkpoint partially into a new model, however in the future we will have a more direct API which is easier to follow.
- Slow memory leak occurs during ESM-2 pretraining, which can cause OOM during long pretraining runs. Training with a
microbatch size of 48 on 40 A100s raised an out-of-memory error after 5,800 training steps.- Possible workarounds include calling
gc.collect(); torch.cuda.empty_cache()
at every ~1,000 steps, which appears
to reclaim the consumed memory; or training with a lower microbatch size and re-starting training from a saved
checkpoint periodically.
- Possible workarounds include calling
External Partner Contributions
We would like to thank the following organizations for their insightful discussions guiding the development of the BioNeMo Framework and their valuable contributions to the codebase. We are grateful for your collaboration.
Changes
- Add GitHub workflow by @ohadmo in #9
- Move v2 commits over. by @jstjohn in #8
- Jstjohn/fix geneformer multinode by @jstjohn in #17
- places the ptl artifacts ignore lines to the root directory only. by @skothenhill-nv in #21
- ESM2 implementation by @farhadrgh in #28
- Update dependency tags to match PR #36, and try to fix test failure by @jstjohn in #39
- Test checkpoint IO loss is close to expected. by @jstjohn in #37
- Change to gelu default from relu which is what we actually used before by @jstjohn in #20
- Make artifact downloads more robust by @pstjohn in #41
- Add devcontainer config for bionemo2 by @pstjohn in #5
- Add license check to pre-commit hook by @ohadmo in #22
- Use github runners to run pre-commit hooks by @pstjohn in #42
- Add back the removed bionemo-core sub-package by @malcolmgreaves in #25
- trivial commit to bionemo2 by @broland-hat in #19
- Add mamba as a dependency in the dockerfile by @pstjohn in #44
- Add future TE support and mixed precision support to biobert test by @jstjohn in #43
- Add trufflehog as a github action check by @pstjohn in #45
- Adds CONTRIBUTING, CODE-REVIEW guides and pull request template by @malcolmgreaves in #10
- Use precision lowest value instead of -torch.inf by @farhadrgh in #35
- Add NeMo and Megatron-LM as git submodules by @pstjohn in #52
- Add a CLI option to restore training from a nemo1 checkpoint by @jstjohn in #54
- Add some additional ruff checks, ignoring existing violations by @pstjohn in #56
- Reorganize bionemo-contrib into namespace packages by @malcolmgreaves in #51
- Update devcontainer for new package structure by @pstjohn in #62
- Tell pytest to ignore 3rdparty/{NeMo,MegatronLM} by @malcolmgreaves in #61
- Clean up src vs test mirroring rule violations. by @jstjohn in #66
- fixing devcontainer target by @pstjohn in #64
- adding merge_group to existing actions by @pstjohn in #71
- Split out the lightning example tutorial by @jstjohn in #67
- Reconfigure the pre-commit workflow by @pstjohn in #63
- convert root_directory to a field with default_factory by @pstjohn in #58
- Checkpointing example with Geneformer by @skothenhill-nv in #24
- Updates to devcontainer by @skothenhill-nv in #77
- Adding license, and contributing guidelines from #72 and #65 by @jstjohn in #74
- adding some additional docstrings by @pstjohn in #81
- Pin ptl to <2.4.0 to fix nemo bug by @pstjohn in #86
- Add documentation build system for BioNeMo v2 by @pstjohn in #40
- Add BERT-style masking function by @pstjohn in #55
- fix post-create command by @pstjohn in #88
- Pbinder/move scdl by @polinabinder1 in #76
- Add ESM2 Dataset and Datamodule by @pstjohn in #78
- Upgrade nemo and megatron, and fix configs to reflect the change by @jstjohn in #92
- Bump 3rdparty/Megatron-LM from
104d864
tocf0f9b2
by @dependabot in #96 - ESM2 Golden Value Testing by @farhadrgh in #85
- fixing version issue by @polinabinder1 in #90
- adding github action for docs deployment by @pstjohn in #98
- Jared/v2 main/nvidia styles by @jwilber in #101
- Handle special tokens in the bert masking function by @pstjohn in #99
- add search highlight + code copy capabilities by @jwilber in #102
- add internal link for devcontainer cache by @pstjohn in #105
- Fix Geneformer huggingface links by @ohadmo in #106
- Fixing secuirty scan vulnerabilities by @ohadmo in #104
- add jupyter notebook support in documentation by @pstjohn in #109
- Adding Dataloading Test cases and documentation by @polinabinder1 in #107
- Bump 3rdparty/NeMo from
e6c0e72
toff7c614
by @dependabot in #103 - Pbinder/readme modify by @polinabinder1 in #115
- Promote nltk version to address GHSA-cgvx-9447 by @ohadmo in #114
- moving test data around by @polinabinder1 in #118
- Bump 3rdparty/Megatron-LM from
cf0f9b2
toef85bc9
by @dependabot in #124 - Establish CODEOWNERS for bionemo2 by @malcolmgreaves in #121
- chown /usr/local's dist-packages to allow editing them in the devcontainer by @pstjohn in #111
- Stop and Go harness and tests for geneformer and GPT. by @skothenhill-nv in #116
- Bump NeMo/Mcore by @skothenhill-nv in #127
- Complete ESM2 pretraining by @sichu2023 in #112
- LightningDataModule for webdataset by @DejunL in #100
- Add module for loading test data. by @pstjohn in #120
- Jwilber/load nb from subpackages by @jwilber in #128
- Nested weight munging fine-tuning/continue training example and test for example model and geneformer. by @jstjohn in #97
- Make example notebook runnable in bionemo-scdl by @jwilber in #130
- Revert "Jwilber/load nb from subpackages" by @pstjohn in #140
- package resource files with installed package by @pstjohn in #137
- move CI scripts to central location by @pstjohn in #131
- Fine-tuning CLI example for geneformer by @jstjohn in #139
- bugfix: mkdocs-gen-files expects paths relative to the script's location by @jwilber in #141
- Add back contributing & code review guidelines by @malcolmgreaves in #142
- setuptools sub-package local vs. publish by @malcolmgreaves in #133
- Add some additional submodule commands to README by @pstjohn in #147
- bump Megatron by @pstjohn in #148
- fix post-create command by @pstjohn in #152
- epoch-level shuffling in ESM2 dataset by @pstjohn in #150
- Bump 3rdparty/NeMo from
ff7c614
to8f0d0c7
by @dependabot in #145 - Bump 3rdparty/NeMo from
18d81b1
to0f8a531
by @dependabot in #156 - edits to make tests more amenable to being run against an installed package by @pstjohn in #154
- update branch name bionemo2 by @dorotat-nv in #160
-
- Nvidia security policy document by @malcolmgreaves in #163
- Geneformer PEFT by @gwarmstrong in #155
- Provide single team email address for authors in Python package metadata by @malcolmgreaves in #167
- Add bionemo-gemoetric: A component library for PyTorch Geometric Models & Data by @malcolmgreaves in #110
- Add documentation covering megatron and code structure rationalle by @jstjohn in #153
- add dependabot file by @pstjohn in #161
- add tokenization test by @pstjohn in #169
- Bump 3rdparty/Megatron-LM from
b6887d3
to0bda578
by @dependabot in #171 - refactor doc structure and look by @jwilber in #143
- lowercase file name so mkdocs picks up correctly by @jwilber in #173
- Bump 3rdparty/NeMo from
0f8a531
toa7d1896
by @dependabot in #172 - Add a tested function to see if model parallel is enabled by @jstjohn in #175
- Add Dorota + Peter as owners for ci scripts by @malcolmgreaves in #166
- use importlib resources for files by @pstjohn in #178
- Add option to restore HF masking strategy by @sichu2023 in #177
- multi-epoch dataset resamplers by @pstjohn in #174
- Bump 3rdparty/NeMo from
a7d1896
to9ed0d6c
by @dependabot in #184 - Bump 3rdparty/Megatron-LM from
0bda578
to08e80b0
by @dependabot in #183 - ESM2 finetuning by @farhadrgh in #136
- Fix esm2 pp/tp by @sichu2023 in #189
- add nemo-run as a git submodule by @pstjohn in #186
- Updates to Getting Started docs by @tshimko-nv in #179
- ESM2 Finetune bug fix and update by @farhadrgh in #197
- ESM2 stop and go test by @farhadrgh in #198
- streamline python packaging with uv by @pstjohn in #135
- Add perplexity logging by @sichu2023 in #144
- Make ruff check pre-commit hook follow what CI does by @malcolmgreaves in #201
- refactor lightning module by @malcolmgreaves in #123
- Revert "refactor lightning module" by @pstjohn in #217
- add dev tools to devcontainer build by @pstjohn in #210
- Migrate ESM2 to transformer engine by @sichu2023 in #199
- ESM2 LoRA by @gwarmstrong in #218
- ESM2 Fine-tune datamodule - epoch sampler by @farhadrgh in #202
- updating uniprot dataset card by @pstjohn in #200
- install geometric dependencies before invalidating caches with source copy by @pstjohn in #224
- adding better descriptions to bionemo-llm and bionemo-testing by @pstjohn in #222
- Install test deps in release image, fix scdl example_notebook by @pstjohn in #221
- add ci/scripts from ci repo - bionemo2 by @dorotat-nv in #214
- Make multi-line RUN statements fail fast by @malcolmgreaves in #225
- [FEA] size-aware batching: a package for creating mini-batch in a memory consumption-aware manner by @DejunL in #168
- try out gh page url to resolve 404 error by @jwilber in #233
- Wandb integration by @olachinkei in #205
- Jwilber/fix docs dir creation by @jwilber in #227
- Resolve NaNs in ESM2 token-level fine-tuning loss by @farhadrgh in #236
- add docs test and remove duplicate pytest call by @pstjohn in #231
- Add check bug fix label workflow by @yzhang123 in #243
- some fixes to test builds by @pstjohn in #246
- Jwilber/dark mode code color by @jwilber in #252
- Add new paths for nemo2 checkpoints and update docs to use them by @jstjohn in #241
- Megatron dataset compatibility checks by @jstjohn in #230
- add megatron datasets background, restructure background folder by @pstjohn in #237
- Add check bug fix label workflow by @yzhang123 in #250
- ESM2 Finetuning README by @farhadrgh in #240
- remove confest and make glob more specific by @pstjohn in #256
- add ngc url artefacts bionemo2 by @dorotat-nv in #254
- Add cc-by-4 attribution to cellxgene datacard by @jstjohn in #255
- fix infer_global_batch_size by @sichu2023 in #261
- Refactor lightning module by @malcolmgreaves in #235
- [cye/esm2-peft-tutorial] Add tutorial for ESM2 fine-tuning (training and inference), and PEFT training (but not inference). by @cspades in #263
- Replace Launcher Script with Justfile & Standalone Scripts + Instructions for External by @malcolmgreaves in #239
- ESM2 Model Card by @farhadrgh in #234
- Add-back
BioBertLightningModule
to fix model load bug by @malcolmgreaves in #268 - Nemo 2 Model Checkpoint Load Test by @malcolmgreaves in #270
- Allow different model parallelism in pretrain/fine-tune or pretrain1/pretrain2 checkpoints. by @jstjohn in #276
- Fixing the CLI for NGC paths that dump to stdout by @jstjohn in #271
- Adds geneformer overview by @skothenhill-nv in #279
- Updates README by @skothenhill-nv in #282
- Add pretraining documentation by @sichu2023 in #283
- Update Getting Started documentation to reflect BioNeMo2 workflow by @tshimko-nv in #208
- Updated README documentaiton for bionemo-{fw,core} by @malcolmgreaves in #285
- Fix bionemo-size-aware-batching, standardize pyproject.toml's & dependencies by @malcolmgreaves in #284
- move ESM2 dataset's odd rng call to use random_utils by @pstjohn in #280
- rename bionemo-fw-ea to bionemo-framework by @yzhang123 in #292
- Fix ESM2 doc by @sichu2023 in #291
- Fix address in docs by @farhadrgh in #297
- remove extra row in datasets by @pstjohn in #295
- Update README.md with marketing by @ktretina in #289
- Fix DATA_DIR in esm2 pretraining by @pstjohn in #298
- Drop dependency to internal docs by @farhadrgh in #303
- support nsys profiling on ESM2, add downstream improvements to hit P0 perf by @sichu2023 in #300
- Fix tach modules & unpin dev version by @malcolmgreaves in #299
- Add geneformer bionemo1 disclaimer by @jstjohn in #278
- Fix variable length ESM2 pretraining by @sichu2023 in #306
- Pstjohn/release v2.0/releasenotes memleak by @pstjohn in #329
- Add initial configuration for mike (version management for docs) by @tshimko-nv in #330
- Final October docs edits by @tshimko-nv in #331
- Update initialization in response to VDR by @tshimko-nv in #334
- Improve ESM2 pretraining tutorial from VDR feedback by @tshimko-nv in #336
- Update container location and tag for 2.0 release by @tshimko-nv in #337
- Remove links in Overview Docs by @tshimko-nv in #338
- FineTuning tutorial update [VDR] by @tshimko-nv in #342
- Update ESM2 model card with benchmarks by @pstjohn in #341
- Remove broken Release Notes links from v2.0 docs build by @tshimko-nv in #343
- Fix broken docs links on mike build by @tshimko-nv in #344
- Fix all license headers to Apache by @trvachov in #347
- Update VERSION - release branch v2.0 by @dorotat-nv in #354
Full Changelog: https://github.com/NVIDIA/bionemo-framework/commits/v2.0
Documentation and Field Support
Additional support and significant documentation overhauls performed by: