Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Worker]Lazy import torch_npu #184

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

Potabk
Copy link
Contributor

@Potabk Potabk commented Feb 27, 2025

What this PR does / why we need it?

To avoid unnecessary delays, we only import torch_npu when profilling is enabled.

Does this PR introduce any user-facing change?

How was this patch tested?

this is my test srcipts

# SPDX-License-Identifier: Apache-2.0

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="facebook/opt-125m")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

and the result is as follows:

INFO 02-27 01:45:12 [__init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 02-27 01:45:12 [__init__.py:32] name=ascend, value=vllm_ascend:register
INFO 02-27 01:45:12 [__init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 02-27 01:45:12 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 02-27 01:45:12 [__init__.py:44] plugin ascend loaded.
INFO 02-27 01:45:12 [__init__.py:198] Platform plugin ascend is activated
INFO 02-27 01:45:27 [config.py:569] This model supports multiple tasks: {'reward', 'generate', 'embed', 'classify', 'score'}. Defaulting to 'generate'.
INFO 02-27 01:45:27 [llm_engine.py:235] Initializing a V0 LLM engine (v0.7.3.dev245+gcd1f843f) with config: model='/root/wl/cache/modelscope/models/facebook/opt-125m', speculative_config=None, tokenizer='/root/wl/cache/modelscope/models/facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/wl/cache/modelscope/models/facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
INFO 02-27 01:45:28 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 02-27 01:45:28 [utils.py:2298] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.NPUWorker object at 0xfffd04c2aa40>
INFO 02-27 01:45:36 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.09it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.08it/s]

INFO 02-27 01:45:39 [executor_base.py:111] # npu blocks: 98088, # CPU blocks: 7281
INFO 02-27 01:45:39 [executor_base.py:116] Maximum concurrency for 2048 tokens per request: 766.31x
INFO 02-27 01:45:39 [llm_engine.py:441] init engine (profile, create kv cache, warmup model) took 2.03 seconds
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 15.49it/s, est. speed input: 100.76 toks/s, output: 248.00 toks/s]
Prompt: 'Hello, my name is', Generated text: ' is an acronym, the name of the name of the name is a phrase used'
Prompt: 'The president of the United States is', Generated text: ' is the subject of interest to be realized. The subject of interest to be realized'
Prompt: 'The capital of France is', Generated text: ' the star of the night is the name of the night is the name of the'
Prompt: 'The future of AI is', Generated text: ' is the same as a good example of any of the above mentioned.\n\n'

Signed-off-by: wangli <wangli858794774@gmail.com>
@wangxiyuan
Copy link
Collaborator

this relate to worker initailization. @noemotiovon @MengqingCao not sure if this will break ray case?

@noemotiovon
Copy link
Contributor

I don't think so, but just to be sure, let me test it.

@noemotiovon
Copy link
Contributor

I've tested it, and it doesn't affect the Ray backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants