A simple solution for benchmarking vLLM, SGLang, and TensorRT-LLM on Modal with guidellm. ⏱️
pip install -r requirements.txt
To run a single benchmark, you can use the run-benchmark
command, which will save your results to a local file.
For example, to run a synchronous-rate benchmark with vLLM:
MODEL=Qwen/Qwen2.5-Coder-7B-Instruct
OUTPUT_PATH=results.json
modal run -w $OUTPUT_PATH cli.py::run_benchmark --model $MODEL --llm-server-type vllm
Or, to run a fixed-rate multi-GPU benchmark with SGLang:
GPU_COUNT=4
MODEL=meta-llama/Llama-3.3-70B-Instruct
REQUESTS_PER_SECOND=5
modal run -w $OUTPUT_PATH cli.py::run_benchmark --gpu "H100:$GPU_COUNT" --model $MODEL --llm-server-type sglang --rate-type constant --rate $REQUESTS_PER_SECOND --llm-server-config "{\"extra_args\": [\"--tp-size\", \"$GPU_COUNT\"]}"
Or, to run a throughput test with TensorRT-LLM:
modal run -w $OUTPUT_PATH cli.py::run_benchmark --model $MODEL --llm-server-type tensorrt-llm --rate-type throughput
To run multiple benchmarks at once, first deploy the project:
modal deploy -m stopwatch
Then, call the function remotely:
To run multiple benchmarks at once, you can use the run-benchmark-function
command, along with a configuration file.
python cli.py run-benchmark-suite configs/data-distributions.yaml
Once the suite has finished, you will be prompted to open a link to a Datasette{:target="_blank"} UI with your results.
To profile vLLM with the PyTorch profiler, use the following command:
python cli.py run-profiler --model meta-llama/Llama-3.1-8B-Instruct --num-requests 10
Once the profiling is done, you will be prompted to download the generated trace and reveal it in Finder. Keep in mind that generated traces can get very large, so it is recommended to only send a few requests while profiling. Traces can then be visualized at https://ui.perfetto.dev.
Stopwatch is available under the MIT license. See the LICENSE file for more details.