From c71743c57e8e232049e1215992b20c1a8aea15b2 Mon Sep 17 00:00:00 2001 From: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Date: Wed, 12 Feb 2025 16:55:33 +0800 Subject: [PATCH] add vllm-ascend usage doc & fix doc format Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> --- docs/index.md | 13 +- docs/installation.md | 32 ++-- docs/quick_start.md | 19 ++- docs/usage/feature_support.md | 34 ++-- docs/usage/running_vllm_with_ascend.md | 208 ++++++++++++++++++++++++- docs/usage/supported_models.md | 44 +++--- 6 files changed, 281 insertions(+), 69 deletions(-) diff --git a/docs/index.md b/docs/index.md index 860501b3..d013e6eb 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,15 +1,16 @@ # Ascend plugin for vLLM + vLLM Ascend plugin (vllm-ascend) is a community maintained hardware plugin for running vLLM on the Ascend NPU. -This plugin is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162), providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM. +This plugin is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162), providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM. By using vLLM Ascend plugin, popular open-source models, including Transformer-like, Mixture-of-Expert, Embedding, Multi-modal LLMs can run seamlessly on the Ascend NPU. ## Contents -- [Quick Start](./quick_start.md) -- [Installation](./installation.md) +- [Quick Start](./quick_start.md) +- [Installation](./installation.md) - Usage - - [Running vLLM with Ascend](./usage/running_vllm_with_ascend.md) - - [Feature Support](./usage/feature_support.md) - - [Supported Models](./usage/supported_models.md) + - [Running vLLM with Ascend](./usage/running_vllm_with_ascend.md) + - [Feature Support](./usage/feature_support.md) + - [Supported Models](./usage/supported_models.md) diff --git a/docs/installation.md b/docs/installation.md index d2646d52..f7f1ef8f 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -1,20 +1,21 @@ # Installation -### 1. Dependencies -| Requirement | Supported version | Recommended version | Note | -| ------------ | ------- | ----------- | ----------- | -| Python | >= 3.9 | [3.10](https://www.python.org/downloads/) | Required for vllm | -| CANN | >= 8.0.RC2 | [8.0.RC3](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.0.beta1) | Required for vllm-ascend and torch-npu | -| torch-npu | >= 2.4.0 | [2.5.1rc1](https://gitee.com/ascend/pytorch/releases/tag/v6.0.0.alpha001-pytorch2.5.1) | Required for vllm-ascend | -| torch | >= 2.4.0 | [2.5.1](https://github.com/pytorch/pytorch/releases/tag/v2.5.1) | Required for torch-npu and vllm required | +## 1. Dependencies -### 2. Prepare Ascend NPU environment +| Requirement | Supported version | Recommended version | Note | +| ----------- | ----------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------- | +| Python | >= 3.9 | [3.10](https://www.python.org/downloads/) | Required for vllm | +| CANN | >= 8.0.RC2 | [8.0.RC3](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.0.beta1) | Required for vllm-ascend and torch-npu | +| torch-npu | >= 2.4.0 | [2.5.1rc1](https://gitee.com/ascend/pytorch/releases/tag/v6.0.0.alpha001-pytorch2.5.1) | Required for vllm-ascend | +| torch | >= 2.4.0 | [2.5.1](https://github.com/pytorch/pytorch/releases/tag/v2.5.1) | Required for torch-npu and vllm required | + +## 2. Prepare Ascend NPU environment Below is a quick note to install recommended version software: -#### Containerized installation +### Containerized installation -You can use the [container image](https://hub.docker.com/r/ascendai/cann) directly with one line command: +You can use the [container image](https://hub.docker.com/r/ascendai/cann) directly with one line command: ```bash docker run \ @@ -33,13 +34,13 @@ docker run \ You do not need to install `torch` and `torch_npu` manually, they will be automatically installed as `vllm-ascend` dependencies. -#### Manual installation +### Manual installation -Or follow the instructions provided in the [Ascend Installation Guide](https://ascend.github.io/docs/sources/ascend/quick_install.html) to set up the environment. +Or follow the instructions provided in the [Ascend Installation Guide](https://ascend.github.io/docs/sources/ascend/quick_install.html) to set up the environment. -### 3. Building +## 3. Building -#### Build Python package from source +### Build Python package from source ```bash git clone https://github.com/vllm-project/vllm-ascend.git @@ -47,7 +48,8 @@ cd vllm-ascend pip install -e . ``` -#### Build container image from source +### Build container image from source + ```bash git clone https://github.com/vllm-project/vllm-ascend.git cd vllm-ascend diff --git a/docs/quick_start.md b/docs/quick_start.md index 548eb5ac..44c5cc82 100644 --- a/docs/quick_start.md +++ b/docs/quick_start.md @@ -1,17 +1,20 @@ # Quick Start ## Prerequisites + ### Support Devices + - Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2) - Atlas 800I A2 Inference series (Atlas 800I A2) ### Dependencies -| Requirement | Supported version | Recommended version | Note | -|-------------|-------------------| ----------- |------------------------------------------| -| vLLM | main | main | Required for vllm-ascend | -| Python | >= 3.9 | [3.10](https://www.python.org/downloads/) | Required for vllm | -| CANN | >= 8.0.RC2 | [8.0.RC3](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.0.beta1) | Required for vllm-ascend and torch-npu | -| torch-npu | >= 2.4.0 | [2.5.1rc1](https://gitee.com/ascend/pytorch/releases/tag/v6.0.0.alpha001-pytorch2.5.1) | Required for vllm-ascend | -| torch | >= 2.4.0 | [2.5.1](https://github.com/pytorch/pytorch/releases/tag/v2.5.1) | Required for torch-npu and vllm | -Find more about how to setup your environment in [here](docs/environment.md). \ No newline at end of file +| Requirement | Supported version | Recommended version | Note | +| ----------- | ----------------- | ---------------------------------------------------------------------------------------------------- | -------------------------------------- | +| vLLM | main | main | Required for vllm-ascend | +| Python | >= 3.9 | [3.10](https://www.python.org/downloads/) | Required for vllm | +| CANN | >= 8.0.RC2 | [8.0.RC3](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.0.beta1) | Required for vllm-ascend and torch-npu | +| torch-npu | >= 2.4.0 | [2.5.1rc1](https://gitee.com/ascend/pytorch/releases/tag/v6.0.0.alpha001-pytorch2.5.1) | Required for vllm-ascend | +| torch | >= 2.4.0 | [2.5.1](https://github.com/pytorch/pytorch/releases/tag/v2.5.1) | Required for torch-npu and vllm | + +Find more about how to setup your environment in [here](docs/environment.md). diff --git a/docs/usage/feature_support.md b/docs/usage/feature_support.md index b13bbb2d..bc23e394 100644 --- a/docs/usage/feature_support.md +++ b/docs/usage/feature_support.md @@ -1,19 +1,19 @@ # Feature Support -| Feature | Supported | Note | -|---------|-----------|------| -| Chunked Prefill | ✗ | Plan in 2025 Q1 | -| Automatic Prefix Caching | ✅ | Improve performance in 2025 Q1 | -| LoRA | ✗ | Plan in 2025 Q1 | -| Prompt adapter | ✅ || -| Speculative decoding | ✅ | Improve accuracy in 2025 Q1| -| Pooling | ✗ | Plan in 2025 Q1 | -| Enc-dec | ✗ | Plan in 2025 Q1 | -| Multi Modality | ✅ (LLaVA/Qwen2-vl/Qwen2-audio/internVL)| Add more model support in 2025 Q1 | -| LogProbs | ✅ || -| Prompt logProbs | ✅ || -| Async output | ✅ || -| Multi step scheduler | ✅ || -| Best of | ✅ || -| Beam search | ✅ || -| Guided Decoding | ✗ | Plan in 2025 Q1 | +| Feature | Supported | Note | +| ------------------------ | --------------------------------------- | --------------------------------- | +| Chunked Prefill | ✗ | Plan in 2025 Q1 | +| Automatic Prefix Caching | ✅ | Improve performance in 2025 Q1 | +| LoRA | ✗ | Plan in 2025 Q1 | +| Prompt adapter | ✅ | | +| Speculative decoding | ✅ | Improve accuracy in 2025 Q1 | +| Pooling | ✗ | Plan in 2025 Q1 | +| Enc-dec | ✗ | Plan in 2025 Q1 | +| Multi Modality | ✅ (LLaVA/Qwen2-vl/Qwen2-audio/internVL) | Add more model support in 2025 Q1 | +| LogProbs | ✅ | | +| Prompt logProbs | ✅ | | +| Async output | ✅ | | +| Multi step scheduler | ✅ | | +| Best of | ✅ | | +| Beam search | ✅ | | +| Guided Decoding | ✗ | Plan in 2025 Q1 | diff --git a/docs/usage/running_vllm_with_ascend.md b/docs/usage/running_vllm_with_ascend.md index 03de8dd5..50f4d92a 100644 --- a/docs/usage/running_vllm_with_ascend.md +++ b/docs/usage/running_vllm_with_ascend.md @@ -1 +1,207 @@ -# Running vLLM with Ascend \ No newline at end of file +# Running vLLM with Ascend + +## Preparation + +### Check CANN Environment + +Check your CANN environment: + +```bash +cd /usr/local/Ascend/ascend-toolkit/latest/-linux # : aarch64 or x86_64 +cat ascend_toolkit_install.info +``` + +The cann version should >= `8.0.RC2`, for example: + +```bash +package_name=Ascend-cann-toolkit +version=8.0.RC3 +``` + +### Check NPU Device + +Check your available NPU chips: + +```bash +npu-smi info +``` + +### Download Model + +Install modelscope: + +```bash +pip install modelscope +``` + +Download model with modelscope python sdk: + +```python +# /root/models/model_download.py +from modelscope import snapshot_download + +model_dir = snapshot_download('Qwen/Qwen2.5-7B-Instruct', cache_dir='/root/models') +``` + +Start downloading: + +```bash +python model_download.py +``` + +To use models from ModelScope instead of HuggingFace Hub, set an environment variable: + +```bash +export VLLM_USE_MODELSCOPE=True +``` + +## Offline Inference + +### Install vllm and vllm-ascend + +Install vllm and vllm-ascend directly with pip: + +```bash +pip install vllm vllm-ascend +``` + +### Offline Inference on a Single NPU + +Run the following script to execute offline inference on a single NPU: + +```python +from vllm import LLM, SamplingParams + +prompts = [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is", +] +sampling_params = SamplingParams(temperature=0.8, top_p=0.95) +llm = LLM(model="Qwen/Qwen2.5-7B-Instruct") + +outputs = llm.generate(prompts, sampling_params) +for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") +``` + +> [!TIP] +> You can use your local models for offline inference by replacing the value of `model` in `LLM()` with `path/to/model`, e.g. `/root/models/Qwen/Qwen2.5-7B-Instruct`. + +> [!NOTE] +> +> - `temperature`: Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. +> - `top_p`: Float that controls the cumulative probability of the top tokens to consider. + +You can find more information about the sampling parameters [here](https://docs.vllm.ai/en/stable/api/inference_params.html#sampling-params). + +If you run this script successfully, you can see the info shown below: + +```bash +Processed prompts: 100%|███████████████████████| 4/4 [00:00<00:00, 4.10it/s, est. speed input: 22.56 toks/s, output: 65.62 toks/s] +Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I' +Prompt: 'The president of the United States is', Generated text: ' Statesman A, and the vice president is Statesman B. If they are' +Prompt: 'The capital of France is', Generated text: ' the city of Paris. This is a fact that can be found in any geography' +Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the' +``` + +## Online Serving + +### Run Docker Container + +Build your docker image using `vllm-ascend/Dockerfile`: + +```bash +docker build -t vllm-ascend:1.0 . +``` + +> [!NOTE] +> `.` is the dir of your Dockerfile. + +Launch your container: + +```bash +docker run \ + --name vllm-ascend \ + --device /dev/davinci0 \ + --device /dev/davinci_manager \ + --device /dev/devmm_svm \ + --device /dev/hisi_hdc \ + -v /usr/local/dcmi:/usr/local/dcmi \ + -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ + -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ + -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ + -v /etc/ascend_install.info:/etc/ascend_install.info \ + -v /root/models:/root/models \ + -it vllm-ascend bash +``` + +> [!TIP] +> To use your local model, you should mount your model dir to container, e.g. `-v /root/models:/root/models`. + +> [!NOTE] +> You can set `davinci0 ~ davinci7` to specify a different NPU. Find more info about your device using `npu-smi info`. + +### Online Serving on a Single NPU + +vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments. + +Run the following command to start the vLLM server on a single NPU: + +```bash +vllm serve Qwen/Qwen2.5-7B-Instruct +``` + +Once your server is started, you can query the model with input prompts: + +```bash +curl http://localhost:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen/Qwen2.5-7B-Instruct", + "prompt": "San Francisco is a", + "max_tokens": 7, + "temperature": 0 + }' +``` + +> [!TIP] +> You can use your local models when launching the vllm server or quering the model by replacing the value of `model` with `path/to/model`, e.g. `/root/models/Qwen/Qwen2.5-7B-Instruct`. + +If you query the server successfully, you can see the info shown below: + +```bash +... +``` + +## Distributed Inference + +vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. To run multi-GPU inference with the `LLM` class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. + +```python +from vllm import LLM + +prompts = [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is", +] +sampling_params = SamplingParams(temperature=0.8, top_p=0.95) +llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", tensor_parallel_size=4) + +outputs = llm.generate(prompts, sampling_params) +for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") +``` + +If you run this script successfully, you can see the info shown below: + +```bash +... +``` diff --git a/docs/usage/supported_models.md b/docs/usage/supported_models.md index edf3df6c..49e0d650 100644 --- a/docs/usage/supported_models.md +++ b/docs/usage/supported_models.md @@ -1,24 +1,24 @@ # Supported Models -| Model | Supported | Note | -|---------|-----------|------| -| Qwen 2.5 | ✅ || -| Mistral | | Need test | -| DeepSeek v2.5 | |Need test | -| LLama3.1/3.2 | ✅ || -| Gemma-2 | |Need test| -| baichuan | |Need test| -| minicpm | |Need test| -| internlm | ✅ || -| ChatGLM | ✅ || -| InternVL 2.5 | ✅ || -| Qwen2-VL | ✅ || -| GLM-4v | |Need test| -| Molomo | ✅ || -| LLaVA 1.5 | ✅ || -| Mllama | |Need test| -| LLaVA-Next | |Need test| -| LLaVA-Next-Video | |Need test| -| Phi-3-Vison/Phi-3.5-Vison | |Need test| -| Ultravox | |Need test| -| Qwen2-Audio | ✅ || +| Model | Supported | Note | +| ------------------------- | --------- | --------- | +| Qwen 2.5 | ✅ | | +| Mistral | | Need test | +| DeepSeek v2.5 | | Need test | +| LLama3.1/3.2 | ✅ | | +| Gemma-2 | | Need test | +| baichuan | | Need test | +| minicpm | | Need test | +| internlm | ✅ | | +| ChatGLM | ✅ | | +| InternVL 2.5 | ✅ | | +| Qwen2-VL | ✅ | | +| GLM-4v | | Need test | +| Molomo | ✅ | | +| LLaVA 1.5 | ✅ | | +| Mllama | | Need test | +| LLaVA-Next | | Need test | +| LLaVA-Next-Video | | Need test | +| Phi-3-Vison/Phi-3.5-Vison | | Need test | +| Ultravox | | Need test | +| Qwen2-Audio | ✅ | |