NVIDIA BioNeMo Framework 2.1
·
2 commits
to release-v2.1
since this release
New Features:
- ESM2 Implementation
- Updated the ESM-2 Model Card with detailed performance benchmarks comparing BioNeMo2 training against vanilla pytorch.
- Added ESM-2 inference endpoint for evaluating pre-trained models
- Size-Aware Batching
- Added SizeAwareBatchSampler, a pytorch data sampler that batches elements of varying sizes while ensuring that the total size of each batch does not exceed a specified maximum.
- Added BucketBatchSampler, another pytorch data sampler that groups elements of varying sizes based on predefined bucket ranges, and create batches with elements from each bucket to ensure that each batch has elements with homogeneous sizes.
- CLI Support
- Added pydantic interface for pretraining jobs via parsing JSON configuration files that enables passing customized Model and DataModules classes.
- Implemented pydantic configuration for Geneformer and ESM2 pretraining and finetuning.
- Added 'recipes' for generating validated JSON files to be used with pydantic interface.
- Added installable scripts for 2/3 respectively, bionemo-esm2-recipe, bionemo-esm2-train, bionemo-geneformer-recipe, bionemo-geneformer-train.
- Geneformer support in BioNeMo2:
- Tested pre-training scripts and fine-tuning example scripts that can be used as a starting point for users to create custom derivative models.
- Geneformer 10M and 106M checkpoints ported from BioNeMo v1 into BioNeMo v2 available and included in documentation.
- Added inference scripts
- Documentation
- Cell type classification example notebook which covers the process of converting anndata into our internal format, and running inference on that data with a geneformer checkpoint, as well as making use of the inference results.
- Updated Getting Started guide, ESM-2 tutorials
- Added Frequently Asked Questions (FAQ) page
Changes
- Final October docs edits by @tshimko-nv in #331
- Update container location and tag for 2.0 release by @tshimko-nv in #337
- Remove broken Release Notes links from v2.0 docs build by @tshimko-nv in #343
- Tell pytest to ignore 3rdparty/{NeMo,MegatronLM} by @malcolmgreaves in #61
- Add back the removed bionemo-core sub-package by @malcolmgreaves in #25
- Fix bionemo-size-aware-batching, standardize pyproject.toml's & dependencies by @malcolmgreaves in #284
- Add check bug fix label workflow by @yzhang123 in #250
- Adds geneformer overview by @skothenhill-nv in #279
- Add ESM2 Dataset and Datamodule by @pstjohn in #78
- Test checkpoint IO loss is close to expected. by @jstjohn in #37
- fix post-create command by @pstjohn in #152
- Drop dependency to internal docs by @farhadrgh in #303
- Add initial configuration for mike (version management for docs) by @tshimko-nv in #330
- Update ESM2 model card with benchmarks by @pstjohn in #341
- Geneformer PEFT by @gwarmstrong in #155
- Update initialization in response to VDR by @tshimko-nv in #334
- Add GitHub workflow by @ohadmo in #9
- Reorganize bionemo-contrib into namespace packages by @malcolmgreaves in #51
- Improve ESM2 pretraining tutorial from VDR feedback by @tshimko-nv in #336
- install geometric dependencies before invalidating caches with source copy by @pstjohn in #224
- ESM2 LoRA by @gwarmstrong in #218
- chown /usr/local's dist-packages to allow editing them in the devcontainer by @pstjohn in #111
- add search highlight + code copy capabilities by @jwilber in #102
- ESM2 implementation by @farhadrgh in #28
- Fix broken docs links on mike build by @tshimko-nv in #344
- Updates to Getting Started docs by @tshimko-nv in #179
- fix post-create command by @pstjohn in #88
- refactor doc structure and look by @jwilber in #143
- Make ruff check pre-commit hook follow what CI does by @malcolmgreaves in #201
- Add bionemo-gemoetric: A component library for PyTorch Geometric Models & Data by @malcolmgreaves in #110
- [FEA] size-aware batching: a package for creating mini-batch in a memory consumption-aware manner by @DejunL in #168
- ESM2 Finetune bug fix and update by @farhadrgh in #197
- add dev tools to devcontainer build by @pstjohn in #210
- places the ptl artifacts ignore lines to the root directory only. by @skothenhill-nv in #21
- Jared/v2 main/nvidia styles by @jwilber in #101
- rename bionemo-fw-ea to bionemo-framework by @yzhang123 in #292
- Add BERT-style masking function by @pstjohn in #55
- Add perplexity logging by @sichu2023 in #144
- support nsys profiling on ESM2, add downstream improvements to hit P0 perf by @sichu2023 in #300
- trivial commit to bionemo2 by @broland-hat in #19
- Add geneformer bionemo1 disclaimer by @jstjohn in #278
- Split out the lightning example tutorial by @jstjohn in #67
- Move v2 commits over. by @jstjohn in #8
- Add documentation covering megatron and code structure rationalle by @jstjohn in #153
- try out gh page url to resolve 404 error by @jwilber in #233
- lowercase file name so mkdocs picks up correctly by @jwilber in #173
- use importlib resources for files by @pstjohn in #178
- add nemo-run as a git submodule by @pstjohn in #186
- Add module for loading test data. by @pstjohn in #120
- LightningDataModule for webdataset by @DejunL in #100
- Update dependency tags to match PR #36, and try to fix test failure by @jstjohn in #39
- Change to gelu default from relu which is what we actually used before by @jstjohn in #20
- Jwilber/load nb from subpackages by @jwilber in #128
- Use github runners to run pre-commit hooks by @pstjohn in #42
- Bump 3rdparty/NeMo from
ff7c614
to8f0d0c7
by @dependabot in #145 - Add a tested function to see if model parallel is enabled by @jstjohn in #175
- Handle special tokens in the bert masking function by @pstjohn in #99
- Fix all license headers to Apache by @trvachov in #347
- add dependabot file by @pstjohn in #161
- Checkpointing example with Geneformer by @skothenhill-nv in #24
- epoch-level shuffling in ESM2 dataset by @pstjohn in #150
- Bump 3rdparty/Megatron-LM from
0bda578
to08e80b0
by @dependabot in #183 - move CI scripts to central location by @pstjohn in #131
- setuptools sub-package local vs. publish by @malcolmgreaves in #133
- Nested weight munging fine-tuning/continue training example and test for example model and geneformer. by @jstjohn in #97
- ESM2 Golden Value Testing by @farhadrgh in #85
- Add pretraining documentation by @sichu2023 in #283
- Wandb integration by @olachinkei in #205
- Fix address in docs by @farhadrgh in #297
- update branch name bionemo2 by @dorotat-nv in #160
- Updated README documentaiton for bionemo-{fw,core} by @malcolmgreaves in #285
- Bump NeMo/Mcore by @skothenhill-nv in #127
- Fix variable length ESM2 pretraining by @sichu2023 in #306
- Establish CODEOWNERS for bionemo2 by @malcolmgreaves in #121
- Revert "Jwilber/load nb from subpackages" by @pstjohn in #140
- add ci/scripts from ci repo - bionemo2 by @dorotat-nv in #214
- adding merge_group to existing actions by @pstjohn in #71
- Add license check to pre-commit hook by @ohadmo in #22
- ESM2 Fine-tune datamodule - epoch sampler by @farhadrgh in #202
- Updates README by @skothenhill-nv in #282
- Add some additional submodule commands to README by @pstjohn in #147
- Add documentation build system for BioNeMo v2 by @pstjohn in #40
- Add future TE support and mixed precision support to biobert test by @jstjohn in #43
- add internal link for devcontainer cache by @pstjohn in #105
- Bump 3rdparty/NeMo from
0f8a531
toa7d1896
by @dependabot in #172 - remove extra row in datasets by @pstjohn in #295
- Add a CLI option to restore training from a nemo1 checkpoint by @jstjohn in #54
- Fixing secuirty scan vulnerabilities by @ohadmo in #104
- Update devcontainer for new package structure by @pstjohn in #62
- add megatron datasets background, restructure background folder by @pstjohn in #237
- Pbinder/readme modify by @polinabinder1 in #115
- Adding Dataloading Test cases and documentation by @polinabinder1 in #107
- Bump 3rdparty/Megatron-LM from
cf0f9b2
toef85bc9
by @dependabot in #124 - Replace Launcher Script with Justfile & Standalone Scripts + Instructions for External by @malcolmgreaves in #239
- Add some additional ruff checks, ignoring existing violations by @pstjohn in #56
- ESM2 Model Card by @farhadrgh in #234
- FineTuning tutorial update [VDR] by @tshimko-nv in #342
- remove confest and make glob more specific by @pstjohn in #256
- Pbinder/move scdl by @polinabinder1 in #76
- fix infer_global_batch_size by @sichu2023 in #261
- Adding license, and contributing guidelines from #72 and #65 by @jstjohn in #74
- Nemo 2 Model Checkpoint Load Test by @malcolmgreaves in #270
- Reconfigure the pre-commit workflow by @pstjohn in #63
- ESM2 Finetuning README by @farhadrgh in #240
- add tokenization test by @pstjohn in #169
- Add option to restore HF masking strategy by @sichu2023 in #177
- Update README.md with marketing by @ktretina in #289
- Make artifact downloads more robust by @pstjohn in #41
- Updates to devcontainer by @skothenhill-nv in #77
- Jwilber/fix docs dir creation by @jwilber in #227
- refactor lightning module by @malcolmgreaves in #123
- add jupyter notebook support in documentation by @pstjohn in #109
- Remove links in Overview Docs by @tshimko-nv in #338
- edits to make tests more amenable to being run against an installed package by @pstjohn in #154
- adding github action for docs deployment by @pstjohn in #98
- Bump 3rdparty/NeMo from
18d81b1
to0f8a531
by @dependabot in #156 - Add mamba as a dependency in the dockerfile by @pstjohn in #44
- Fix DATA_DIR in esm2 pretraining by @pstjohn in #298
- Add trufflehog as a github action check by @pstjohn in #45
- package resource files with installed package by @pstjohn in #137
- Promote nltk version to address GHSA-cgvx-9447 by @ohadmo in #114
- bugfix: mkdocs-gen-files expects paths relative to the script's location by @jwilber in #141
- multi-epoch dataset resamplers by @pstjohn in #174
- add docs test and remove duplicate pytest call by @pstjohn in #231
- moving test data around by @polinabinder1 in #118
- [cye/esm2-peft-tutorial] Add tutorial for ESM2 fine-tuning (training and inference), and PEFT training (but not inference). by @cspades in #263
- Fix Geneformer huggingface links by @ohadmo in #106
- adding some additional docstrings by @pstjohn in #81
- Bump 3rdparty/NeMo from
e6c0e72
toff7c614
by @dependabot in #103 - updating uniprot dataset card by @pstjohn in #200
- Complete ESM2 pretraining by @sichu2023 in #112
- Migrate ESM2 to transformer engine by @sichu2023 in #199
- Add NeMo and Megatron-LM as git submodules by @pstjohn in #52
- fixing devcontainer target by @pstjohn in #64
- Add check bug fix label workflow by @yzhang123 in #243
- move ESM2 dataset's odd rng call to use random_utils by @pstjohn in #280
- convert root_directory to a field with default_factory by @pstjohn in #58
- Add devcontainer config for bionemo2 by @pstjohn in #5
- Adds CONTRIBUTING, CODE-REVIEW guides and pull request template by @malcolmgreaves in #10
- Add-back
BioBertLightningModule
to fix model load bug by @malcolmgreaves in #268 - Update Getting Started documentation to reflect BioNeMo2 workflow by @tshimko-nv in #208
- Update VERSION - release branch v2.0 by @dorotat-nv in #354
- Allow different model parallelism in pretrain/fine-tune or pretrain1/pretrain2 checkpoints. by @jstjohn in #276
- Fine-tuning CLI example for geneformer by @jstjohn in #139
- Fix esm2 pp/tp by @sichu2023 in #189
- Add Dorota + Peter as owners for ci scripts by @malcolmgreaves in #166
- Jwilber/dark mode code color by @jwilber in #252
- Add cc-by-4 attribution to cellxgene datacard by @jstjohn in #255
- adding better descriptions to bionemo-llm and bionemo-testing by @pstjohn in #222
- Pin ptl to <2.4.0 to fix nemo bug by @pstjohn in #86
- Add new paths for nemo2 checkpoints and update docs to use them by @jstjohn in #241
- Bump 3rdparty/Megatron-LM from
104d864
tocf0f9b2
by @dependabot in #96 - ESM2 stop and go test by @farhadrgh in #198
- Provide single team email address for authors in Python package metadata by @malcolmgreaves in #167
- Bump 3rdparty/NeMo from
a7d1896
to9ed0d6c
by @dependabot in #184 - Install test deps in release image, fix scdl example_notebook by @pstjohn in #221
- Refactor lightning module by @malcolmgreaves in #235
- Clean up src vs test mirroring rule violations. by @jstjohn in #66
- Fix tach modules & unpin dev version by @malcolmgreaves in #299
- Upgrade nemo and megatron, and fix configs to reflect the change by @jstjohn in #92
- Fix ESM2 doc by @sichu2023 in #291
- Make example notebook runnable in bionemo-scdl by @jwilber in #130
- Use precision lowest value instead of -torch.inf by @farhadrgh in #35
- Add back contributing & code review guidelines by @malcolmgreaves in #142
- Jstjohn/fix geneformer multinode by @jstjohn in #17
- Resolve NaNs in ESM2 token-level fine-tuning loss by @farhadrgh in #236
- Bump 3rdparty/Megatron-LM from
b6887d3
to0bda578
by @dependabot in #171 - ESM2 finetuning by @farhadrgh in #136
- streamline python packaging with uv by @pstjohn in #135
- Stop and Go harness and tests for geneformer and GPT. by @skothenhill-nv in #116
- Revert "refactor lightning module" by @pstjohn in #217
- Megatron dataset compatibility checks by @jstjohn in #230
- add ngc url artefacts bionemo2 by @dorotat-nv in #254
- Make multi-line RUN statements fail fast by @malcolmgreaves in #225
- Fixing the CLI for NGC paths that dump to stdout by @jstjohn in #271
-
- Nvidia security policy document by @malcolmgreaves in #163
- bump Megatron by @pstjohn in #148
- Pstjohn/release v2.0/releasenotes memleak by @pstjohn in #329
- some fixes to test builds by @pstjohn in #246
- fixing version issue by @polinabinder1 in #90
New Contributors
- @skothenhill-nv made their first contribution in #21
- @farhadrgh made their first contribution in #28
- @polinabinder1 made their first contribution in #76
- @dependabot made their first contribution in #96
- @jwilber made their first contribution in #101
- @DejunL made their first contribution in #100
- @gwarmstrong made their first contribution in #155
- @olachinkei made their first contribution in #205
- @cspades made their first contribution in #263
- @ktretina made their first contribution in #289
- @guoqing-zhou made their first contribution in #220
- @savitha-eng made their first contribution in #339
Full Changelog: https://github.com/NVIDIA/bionemo-framework/commits/v2.1