- TensorBoard / wandb support
- Background generator is used (reason of using background generator)
- In Windows, background generator could not be supported. So if error occurs, set false to
use_background_generator
in config
- In Windows, background generator could not be supported. So if error occurs, set false to
- Training state and network checkpoint saving, loading
- Training state includes not only network weights, but also optimizer, step, epoch.
- Checkpoint includes only network weights. This could be used for inference.
- Hydra and Omegaconf is supported
- Distributed Learning using Distributed Data Parallel is supported
- Config with yaml file / easy dot-style access to config
- Code lint / CI
- Code Testing with pytest
assets
dir: icon image ofPytorch Project Template
. You can remove this directory.config
dir: directory for config filesdataloader
dir: dataloader and dataset codes are here. Also, put dataset inmeta
dir.model
dir:model.py
is for wrapping network architecture.model_arch.py
is for coding network architecture.tests
dir: directory forpytest
testing codes. You can check your network's flow of tensor by fixingtests/model/net_arch_test.py
. Just copy & pasteNet_arch.forward
method tonet_arch_test.py
and addassert
phrase to check tensor.utils
dir:utils.py
is for utility. random seed setting, dot-access hyper parameter, get commit hash, etc are here.writer.py
is for writing logs in tensorboard / wandb.
trainer.py
file: this is for setting up and iterating epoch.
── configs <- Hydra configuration files
├── default.yaml <- enter point of hydra
├── datamodule <- Datamodule configs folder (add your data config under it)
| └── default.yaml <- Datamodule default yaml
├── hydra/job_logging <- Hydra configs
| └── custom.yaml <- customize hydra config here
├── model <- Model configs (add your model config under it)
| └── default.yaml <- Model default yaml
└── general <- general config for training
└── default.yaml <- General default yaml (e.g. logger)
- Support for multiple dataloaders and multiple models in one project
- Simply write another model.network.Network class and dataloader.dataloader.Dataset_
- Then add two more yaml file for new network and Dataloader separately under the config/model and config/datamodule
- The yaml file could be quite similar to the default.yaml under under the config/model and config/datamodule
- python3 (3.6, 3.7, 3.8 is tested)
- Write PyTorch version which you want to
requirements.txt
. (https://pytorch.org/get-started/) pip install -r requirements.txt
- Config is written in yaml file
- You can choose configs at
config/default.yaml
. Custom configs are underconfig/job/
- You can choose configs at
general
fieldname
is train name you run.working_dir
is root directory for saving checkpoints, logging logs.device
is device mode for running your model. You can choosecpu
orcuda
random_seed
is for setting python, numpy, pytorch random seed.dist
is for configuring Distributed Data Parallel.gpus
is the number that you want to use with DDP (gpus
value is used atworld_size
in DDP). Not using DDP whengpus
is 0, using all gpus whengpus
is -1.timeout
is seconds for timeout of process interaction in DDP. When this is set as~
, default timeout (1800 seconds) is applied ingloo
mode and timeout is turned off innccl
mode.
data
field- Configs for Dataloader.
- change
_target_
for the dataloader(or data generator) you want to use - glob
train_dir
/test_dir
withfile_format
for Dataloader. - If
divide_dataset_per_gpu
is true, origin dataset is divide into sub dataset for each gpu. This could mean the size of origin dataset should be multiple of number of using gpu. If this option is false, dataset is not divided but epoch goes up in multiple of number of gpus.
model
field- Configs for Network architecture and options for model.
- change
_target_
for the the model network you want to use - You can add configs in yaml format to config your network.
log
field- Configs for logging include tensorboard / wandb logging.
summary_interval
andcheckpoint_interval
are interval of step and epoch between training logging and checkpoint saving.- checkpoint and logs are saved under
working_dir/chkpt_dir
andworking_dir/trainer.log
. Tensorboard logs are saving underworking_dir/outputs/tensorboard
load
field- loading from wandb server is supported
wandb_load_path
isRun path
in overview of run. If you don't want to use wandb load, this field should be~
.network_chkpt_path
is path to network checkpoint file. If using wandb loading, this field should be checkpoint file name of wandb run.resume_state_path
is path to training state file. If using wandb loading, this field should be training state file name of wandb run.
-
pip install -r requirements-dev.txt
for install develop dependencies (this requires python 3.6 and above because of black) -
pre-commit install
for adding pre-commit to git hook -
conda env create -f environment.yml
for creating a conda venv from evironment.yml but some of the dependencies may conflict with your environment like cuda
python trainer.py device=cuda
python trainer.py log.use_wandb=true device=cpu
python trainer.py device=cuda dist.gpus=-1 train.num_epoch=100
- Implement progress bar in Multi-GPU trainning
- Multi-GPU trainning with hydra.utils.instantiate()