Learn how to run inference with Cosmos Diffusion-based World Foundation Models (WFMs) using the NVIDIA NeMo Framework for your custom Physical AI tasks by following this guide.
The NeMo Framework supports the following Cosmos Diffusion models. Review the available models and their compute requirements for post-tuning and inference to determine the best model for your use case.
Model Name | Model Status | Compute Requirements for Inference | Multi-GPU Support |
---|---|---|---|
Cosmos-1.0-Diffusion-7B-Text2World | Supported | 1 NVIDIA GPU* | Supported |
Cosmos-1.0-Diffusion-14B-Text2World | Supported | 1 NVIDIA GPU* | Supported |
Cosmos-1.0-Diffusion-7B-Video2World | Supported | 1 NVIDIA GPU* | Supported |
Cosmos-1.0-Diffusion-14B-Video2WorldB | Supported | 1 NVIDIA GPU* | Supported |
* H100-80GB
or A100-80GB
GPUs are recommended.
Cosmos Diffusion-based WFMs can also be post-trained for a variety of Physical AI tasks and used for inference. Review the following table for a list of available Physical AI post-training tasks:
Post-training Task | Inference Support Status |
---|---|
General post-training | Supported |
Instruction control | Coming Soon |
Action control | Coming Soon |
Camera control | Coming Soon |
Multi-view generation | Coming Soon |
Multi-view generation with vehicle trajectory control | Coming Soon |
- System Configuration
- NVIDIA GPU and driver: Ensure you have access to the minimum compute required to run the model(s), as listed in the model support matrix.
- Containerization Platform: We recommend using Docker with NVIDIA Container Runtime (alternatively, you may use NVIDIA enroot).
- Get your Hugging Face User Access Token, which is required to obtain the Cosmos models for training and inference.
- Get your Weights and Biases API Key for logging and tracking.
git clone git@github.com:NVIDIA/Cosmos.git
The NeMo Framework container supports post-training and inference for Cosmos Diffusion models.
Run the following command to download and start the container:
docker run --ipc=host -it --gpus=all \
-v $PATH_TO_COSMOS_REPO:/workspace/Cosmos \
nvcr.io/nvidia/nemo:cosmos.1.0.1 bash
To help you get started, we've provided a download script to get the Cosmos Diffusion Text2World and Video2World checkpoints from Hugging Face. These checkpoints are in the NeMo distributed checkpoint format required to run post-training and inference with NeMo Framework.
-
Set the following environment variables:
# You must set HF_HOME before running this script. export HF_TOKEN="<your/HF/access/token>" export HF_HOME="<path/to/store/checkpoints>"
-
Run the following command to download the models:
cd /workspace/Cosmos python cosmos1/models/diffusion/nemo/download_diffusion_nemo.py
Running inference with Cosmos Diffusion Text2World models lets you generate a video conditioned on a text prompt. With the Video2World models, you can generate a video conditioned on a text prompt as well as on an image or video. Note that when supplying an image or video for conditioning the following requirements must be met:
- Video: The video must be less than 9 frames long
- Image: The image must be either PNG or JPEG format and have one of the following extensions:
.png
,.jpg
, or.jpeg
Our inference script enables accelerated world generation with context parallel. We use context parallelism to split the diffusion process across multiple GPUs, providing near-linear scaling efficiency. Our diffusion pipeline also allows the user to set a variety of hyperparameters including the random seed, classifier-free guidance scale, negative prompt, video resolution, and video fps.
General post-training is essentially a continuation of pre-training. To perform inference with models that have been post-trained with general post-training, you can set the subject_name
parameter to the subject the model was post-trained on. The prompt
and conditioned_image_or_video_path
parameters are then used to provide the setting and describe the events in the generated world. The final prompt will be "A video of sks {subject_name}
. {prompt}
". We can also use inference/general.py or inference/video2world.py to perform inference on the base models since the model architectures are the same as the general post-trained models.
We also provide the option to upsample the prompt
and make it more detailed. This can improve the quality of the generated world. Note that for Video2World generation, currently the LLM only looks at your text prompt to upsample the initial prompt, and it does not consider your input image/video for prompt upsampling. We will add text + image processing for prompt upsampling in the near future.
Complete the following steps to generate a new output video of a robot cooking.
-
Set the following environment variables:
# HuggingFace Cache to save T5 text encoder, video tokenizer, prompt upsampler, and guardrails weights. export HF_TOKEN="<your/HF/access/token>" export HF_HOME="<path/to/store/checkpoints>" # Number of GPU devices available for inference. Supports up to 8 GPUs for accelerated inference. export NUM_DEVICES=1 export CUDA_VISIBLE_DEVICES=$(seq -s, 0 $((NUM_DEVICES - 1))) # Prompt describing world scene and actions taken by subject (if provided). export PROMPT="The teal robot is cooking food in a kitchen. Steam rises from a simmering pot as the robot chops vegetables on a worn wooden cutting board. Copper pans hang from an overhead rack, catching glints of afternoon light, while a well-loved cast iron skillet sits on the stovetop next to scattered measuring spoons and a half-empty bottle of olive oil."
-
Run the following command:
NVTE_FUSED_ATTN=0 \ torchrun --nproc_per_node=$NUM_DEVICES cosmos1/models/diffusion/nemo/inference/general.py \ --model Cosmos-1.0-Diffusion-7B-Text2World \ --cp_size $NUM_DEVICES \ --num_devices $NUM_DEVICES \ --video_save_path "Cosmos-1.0-Diffusion-7B-Text2World.mp4" \ --guidance 7 \ --seed 1 \ --prompt "$PROMPT" \ --enable_prompt_upsampler
Complete the following steps to generate a new output video conditioned on an input video and a text prompt using the Video2World models.
-
Set the following environment variables:
# HuggingFace Cache to save T5 text encoder, video tokenizer, prompt upsampler, and guardrails weights. export HF_TOKEN="<your/HF/access/token>" export HF_HOME="<path/to/store/checkpoints>" # Number of GPU devices available for inference. Supports up to 8 GPUs for accelerated inference. export NUM_DEVICES=1 export CUDA_VISIBLE_DEVICES=$(seq -s, 0 $((NUM_DEVICES - 1))) # Prompt describing world scene and actions taken by subject (if provided). export PROMPT="<Supply a prompt here>" export CONDITIONED_IMAGE_OR_VIDEO="<Path to conditioned image or video>"
-
Run the following command:
NVTE_FUSED_ATTN=0 \ torchrun --nproc_per_node=$NUM_DEVICES cosmos1/models/diffusion/nemo/inference/video2world.py \ --model Cosmos-1.0-Diffusion-7B-Video2World \ --cp_size $NUM_DEVICES \ --num_devices $NUM_DEVICES \ --video_save_path "Cosmos-1.0-Diffusion-7B-Video2World.mp4" \ --guidance 7 \ --seed 1 \ --prompt "$PROMPT" \ --conditioned_image_or_video_path "$CONDITIONED_IMAGE_OR_VIDEO" \ --num_input_frames 9 \ --enable_prompt_upsampler
Create a post-trained model first, by using the instructions here Then complete the following steps to generate a new output video from this model.
-
Set the following environment variables:
# HuggingFace Cache to save T5 text encoder, video tokenizer, prompt upsampler, and guardrails weights. export HF_TOKEN="<your/HF/access/token>" export HF_HOME="<path/to/store/checkpoints>" # Inference with post-trained model. Find post-trained model under nemo_experiments. Example path: export NEMO_CHECKPOINT=nemo_experiments/cosmos_diffusion_7b_text2world_finetune/default/2024-12-17_01-28-03/checkpoints/epoch=39-step=199/weights # Number of GPU devices available for inference. Supports up to 8 GPUs for accelerated inference. export NUM_DEVICES=1 export CUDA_VISIBLE_DEVICES=$(seq -s, 0 $((NUM_DEVICES - 1))) # Prompt describing world scene and actions taken by subject (if provided). export PROMPT="The teal robot is cooking food in a kitchen. Steam rises from a simmering pot as the robot chops vegetables on a worn wooden cutting board. Copper pans hang from an overhead rack, catching glints of afternoon light, while a well-loved cast iron skillet sits on the stovetop next to scattered measuring spoons and a half-empty bottle of olive oil."
-
Run the following command:
NVTE_FUSED_ATTN=0 \ torchrun --nproc_per_node=8 cosmos1/models/diffusion/nemo/inference/general.py \ --model Cosmos-1.0-Diffusion-7B-Text2World \ --nemo_checkpoint "$NEMO_CHECKPOINT" \ --cp_size $NUM_DEVICES \ --num_devices $NUM_DEVICES \ --video_save_path "Cosmos-1.0-Diffusion-7B-Text2World.mp4" \ --guidance 7 \ --seed 1 \ --prompt "$PROMPT" \ --subject_name "teal robot" \ --enable_prompt_upsampler
The following output is an example video generated from the post-trained model using general.py
:
text2world_example_after_finetune.mp4
The following table details the parameters that can be modified for accelerated inference with NeMo. You can adjust these parameters to optimize performance based on your specific requirements. The model inference hyperparameters listed below have the same functionality as in Cosmos Diffusion Common Parameters.
Parameter | Description | Default |
---|---|---|
--model |
Name of Cosmos Text2World Diffusion model to use for inference. | Cosmos-1.0-Diffusion-7B-Text2World |
--prompt |
Prompt which the sampled video is conditioned on. Tries to generate what is mentioned in the prompt. | None (user must provide) |
--negative_prompt |
Negative prompt for improved quality | "The video captures a series of frames showing ugly scenes..." |
--subject_name |
Name of the subject the model was post-trained on. This can be left empty for base model inference. | None |
--guidance |
A control mechanism that determines how closely the model follows specified conditions (prompt) during the generation process. We recommend starting with a guidance of 7 and increasing it later if necessary. | 7 |
--sampler |
Sampling method used for generation. Only supports RES sampler from this paper. | RES |
--video_save_path |
Location to save generated videos. | Cosmos-1.0-Diffusion-7B-Text2World.mp4 |
--fps |
Frames-per-second of generated video. Cosmos Diffusion models generate videos at 24 FPS by default. | 24 |
--height |
Height of the generated video. Set to 704 pixels by default, which is the largest supported video height for Cosmos Diffusion. | 704 |
--width |
Width of the generated video. Set to 1280 pixels by default, which is the largest supported video width for Cosmos Diffusion. | 1280 |
--seed |
Random seed for generating initial noise sample. Changing this will create a different video for the same prompt. Keep the seed fixed to maintain deterministic video generations. | 1 |
--num_devices |
[1–8] Number of GPUs to use in parallel for inference. | 8 |
--cp_size |
[1–8] Number of context parallel ranks to spawn for parallelized inference. Must be equal to num_devices . |
8 |
-
Set the following environment variables:
# HuggingFace Cache to save T5 text encoder, video tokenizer, prompt upsampler, and guardrails weights. export HF_TOKEN="<your/HF/access/token>" export HF_HOME="<path/to/store/checkpoints>" # Inference with post-trained model. Find post-trained model under nemo_experiments. Example path: export NEMO_CHECKPOINT=nemo_experiments/cosmos_diffusion_7b_video2world_finetune/default/2025-02-03_11-57-33/checkpoints/epoch=39-step=199/weights # Number of GPU devices available for inference. Supports up to 8 GPUs for accelerated inference. export NUM_DEVICES=1 export CUDA_VISIBLE_DEVICES=$(seq -s, 0 $((NUM_DEVICES - 1))) export PROMPT="<Supply a prompt here>" export CONDITIONED_IMAGE_OR_VIDEO="<Path to conditioned image or video>"
-
Run the following command:
NVTE_FUSED_ATTN=0 \ torchrun --nproc_per_node=8 cosmos1/models/diffusion/nemo/inference/video2world.py \ --model Cosmos-1.0-Diffusion-7B-Video2World \ --nemo_checkpoint "$NEMO_CHECKPOINT" \ --cp_size $NUM_DEVICES \ --num_devices $NUM_DEVICES \ --video_save_path "Cosmos-1.0-Diffusion-7B-Video2World.mp4" \ --guidance 7 \ --seed 1 \ --prompt "$PROMPT" \ --conditioned_image_or_video_path "$CONDITIONED_IMAGE_OR_VIDEO" \ --subject_name "teal robot" \ --enable_prompt_upsampler
The following table details the parameters that can be modified for accelerated inference with NeMo. You can adjust these parameters to optimize performance based on your specific requirements. The model inference hyperparameters listed below have the same functionality as in Cosmos Diffusion Common Parameters.
Parameter | Description | Default |
---|---|---|
--model |
Name of Cosmos Video2World Diffusion model to use for inference. | Cosmos-1.0-Diffusion-7B-Video2World |
--prompt |
Prompt which the sampled video is conditioned on. Tries to generate what is mentioned in the prompt. | None (user must provide) |
--conditioned_image_or_video_path |
Input video used for conditioning generations. | None (user must provide) |
--negative_prompt |
Negative prompt for improved quality | "The video captures a series of frames showing ugly scenes..." |
--subject_name |
Name of the subject the model was post-trained on. This can be left empty for base model inference. | None |
--guidance |
A control mechanism that determines how closely the model follows specified conditions (prompt) during the generation process. We recommend starting with a guidance of 7 and increasing it later if necessary. | 7 |
--sampler |
Sampling method used for generation. Only supports RES sampler from this paper. | RES |
--video_save_path |
Location to save generated videos. | Cosmos-1.0-Diffusion-7B-Video2World.mp4 |
--fps |
Frames-per-second of generated video. Cosmos Diffusion models generate videos at 24 FPS by default. | 24 |
--height |
Height of the generated video. Set to 704 pixels by default, which is the largest supported video height for Cosmos Diffusion. | 704 |
--width |
Width of the generated video. Set to 1280 pixels by default, which is the largest supported video width for Cosmos Diffusion. | 1280 |
--seed |
Random seed for generating initial noise sample. Changing this will create a different video for the same prompt. Keep the seed fixed to maintain deterministic video generations. | 1 |
--num_devices |
[1–8] Number of GPUs to use in parallel for inference. | 8 |
--cp_size |
[1–8] Number of context parallel ranks to spawn for parallelized inference. Must be equal to num_devices . |
8 |
Generated videos are saved at the location configured in the SAVE_PATH
parameter.
Tip: For faster inference, you can remove the
--enable_prompt_upsampler
parameter, but this may degrade the generated result.
Disclaimer: The post-training example in this documentation is a demonstration of general post-training and not a guaranteed recipe for success. Post-training outcomes depend heavily on the quality and diversity of the dataset. To achieve good results, ensure your dataset is clean, well-structured, diverse, and properly labeled. Poorly prepared data can lead to issues like overfitting, bias, or poor performance. Carefully curate your dataset to reflect the desired use case for reliable results.