Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

computational resource requirements #7

Open
xubin04 opened this issue Nov 9, 2024 · 2 comments
Open

computational resource requirements #7

xubin04 opened this issue Nov 9, 2024 · 2 comments

Comments

@xubin04
Copy link

xubin04 commented Nov 9, 2024

I am very interested in conducting experiments on this paper. I noticed that you specified using the A600 model graphics card for training. Could you please tell me how many you used? Also, could you share with me the computational resource requirements needed for the related experiments?

@PatrickESA
Copy link
Contributor

Hi @xubin04,

each model we trained fits on a single NVIDIA RTX A6000 (with batch_size: 16). The reason we used multiple GPUs was for training every model on the five folds, for all baselines and ablations. To speed experiments up in parallel, we used circa 4-6 GPUs depending on availability. However, this is no requirement for model development and all experiments can be conducted in series on a single device. Hoping this helps!

@Multihuntr
Copy link
Owner

Multihuntr commented Nov 19, 2024

Hi @xubin04. Thanks for your interest. I would like to add some detail to what Patrick said. All models can be trained on a single machine, provided that machine has enough GPU VRAM and main RAM.

GPU VRAM:

  • UTAE and all other CNN U-Net backbones with batch size of 16 require ~10GB.
  • Max ViT with batch size of 16 requires >40GB.

Main RAM: We provide two flags for keeping the dataset in RAM to make it run faster: cache_context_in_ram and cache_local_in_ram.

  • Both False: the process requires <2GB of RAM
  • Both True (default): requires ~24GB. Speedup is 4x on my machine. Recommended to use num_workers=1 (default).

Warnings:

  1. Don't use cache_context_in_ram != cache_local_in_ram. I wanted to respond sooner with concrete numbers, but I discovered that there's a memory leak somewhere in rasterio. I don't really understand the conditions, but using both caching flags actually uses less memory than using just one or the other because it avoids the memory leak.
  2. With caching flags set to True, RAM usage scales with num_workers because it is done on the fly in each worker. This was chosen so that training could begin immediately without loading everything into RAM at the start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants