benchmarks #1204

sancelot · 2025-02-18T15:00:58Z

sancelot
Feb 18, 2025

Hi,
I have seen your benchmarks

https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html

Inside wsl2.
with qwen2.5-1.5B-Instruct and Transformers I get similar speed, however with vllm , I get around 56tok/sec.

Using vllm api -api v0 or v1 - (inside wsl) I get worse result of 5 token/sec... ,

Please can you detail the setup ?

jklj077 · 2025-02-19T05:25:24Z

jklj077
Feb 19, 2025
Maintainer

I get around 56tok/sec.

what's your setup, e.g. GPU?

vllm api -api v0 or v1

I'm not sure what this means.

Please can you detail the setup

listed at the beginning of the page you referenced. not much can be added but the OS is a Linux distro.

0 replies

sancelot · 2025-02-19T07:34:49Z

sancelot
Feb 19, 2025
Author

I have rtx 4060 gpu and intel iris xe gpu + core i5 13500H

I am not sure, but I think inference runs on cpu. I don't know really how to be sure of it.

my vllm code is ;

self.tokenizer = AutoTokenizer.from_pretrained(model_path)                                  
self.sampling_params  = SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=512)                      
self.model = LLM(model=model_path,device='cuda:0')
text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True  
            )
outputs = self.model.generate([text],self.sampling_params)

vllm api -api v0 or v1:
vllm has a recent new version 1 I think you used v0.

1 reply

jklj077 Feb 19, 2025
Maintainer

I have rtx 4060 gpu

we don't have benchmark results on RTX 4060 Laptop (there appears to be 8x difference in terms of peak bfloat16 performance).

I am not sure, but I think inference runs on cpu.

I think you may need to consult with vllm maintainers on this matter. we are not really familar with WSL with GPUs.

I don't know really how to be sure of it.

you can check if the GPU memory is occupied by vllm; for GPU utilization, you should be able to change the metrics to CUDA if you have CUDA installed

my vllm code is

looks fine to me

vllm has a recent new version 1 I think you used v0.

If you mean https://blog.vllm.ai/2025/01/27/v1-alpha-release.html , it needs the latest version of vllm. the benchmark used 0.6.3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks #1204

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

benchmarks #1204

sancelot Feb 18, 2025

Replies: 2 comments · 1 reply

jklj077 Feb 19, 2025 Maintainer

sancelot Feb 19, 2025 Author

jklj077 Feb 19, 2025 Maintainer

sancelot
Feb 18, 2025

Replies: 2 comments 1 reply

jklj077
Feb 19, 2025
Maintainer

sancelot
Feb 19, 2025
Author

jklj077 Feb 19, 2025
Maintainer