From c71743c57e8e232049e1215992b20c1a8aea15b2 Mon Sep 17 00:00:00 2001
From: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Date: Wed, 12 Feb 2025 16:55:33 +0800
Subject: [PATCH] add vllm-ascend usage doc & fix doc format
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
---
docs/index.md | 13 +-
docs/installation.md | 32 ++--
docs/quick_start.md | 19 ++-
docs/usage/feature_support.md | 34 ++--
docs/usage/running_vllm_with_ascend.md | 208 ++++++++++++++++++++++++-
docs/usage/supported_models.md | 44 +++---
6 files changed, 281 insertions(+), 69 deletions(-)
diff --git a/docs/index.md b/docs/index.md
index 860501b3..d013e6eb 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,15 +1,16 @@
# Ascend plugin for vLLM
+
vLLM Ascend plugin (vllm-ascend) is a community maintained hardware plugin for running vLLM on the Ascend NPU.
-This plugin is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162), providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM.
+This plugin is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162), providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM.
By using vLLM Ascend plugin, popular open-source models, including Transformer-like, Mixture-of-Expert, Embedding, Multi-modal LLMs can run seamlessly on the Ascend NPU.
## Contents
-- [Quick Start](./quick_start.md)
-- [Installation](./installation.md)
+- [Quick Start](./quick_start.md)
+- [Installation](./installation.md)
- Usage
- - [Running vLLM with Ascend](./usage/running_vllm_with_ascend.md)
- - [Feature Support](./usage/feature_support.md)
- - [Supported Models](./usage/supported_models.md)
+ - [Running vLLM with Ascend](./usage/running_vllm_with_ascend.md)
+ - [Feature Support](./usage/feature_support.md)
+ - [Supported Models](./usage/supported_models.md)
diff --git a/docs/installation.md b/docs/installation.md
index d2646d52..f7f1ef8f 100644
--- a/docs/installation.md
+++ b/docs/installation.md
@@ -1,20 +1,21 @@
# Installation
-### 1. Dependencies
-| Requirement | Supported version | Recommended version | Note |
-| ------------ | ------- | ----------- | ----------- |
-| Python | >= 3.9 | [3.10](https://www.python.org/downloads/) | Required for vllm |
-| CANN | >= 8.0.RC2 | [8.0.RC3](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.0.beta1) | Required for vllm-ascend and torch-npu |
-| torch-npu | >= 2.4.0 | [2.5.1rc1](https://gitee.com/ascend/pytorch/releases/tag/v6.0.0.alpha001-pytorch2.5.1) | Required for vllm-ascend |
-| torch | >= 2.4.0 | [2.5.1](https://github.com/pytorch/pytorch/releases/tag/v2.5.1) | Required for torch-npu and vllm required |
+## 1. Dependencies
-### 2. Prepare Ascend NPU environment
+| Requirement | Supported version | Recommended version | Note |
+| ----------- | ----------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------- |
+| Python | >= 3.9 | [3.10](https://www.python.org/downloads/) | Required for vllm |
+| CANN | >= 8.0.RC2 | [8.0.RC3](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.0.beta1) | Required for vllm-ascend and torch-npu |
+| torch-npu | >= 2.4.0 | [2.5.1rc1](https://gitee.com/ascend/pytorch/releases/tag/v6.0.0.alpha001-pytorch2.5.1) | Required for vllm-ascend |
+| torch | >= 2.4.0 | [2.5.1](https://github.com/pytorch/pytorch/releases/tag/v2.5.1) | Required for torch-npu and vllm required |
+
+## 2. Prepare Ascend NPU environment
Below is a quick note to install recommended version software:
-#### Containerized installation
+### Containerized installation
-You can use the [container image](https://hub.docker.com/r/ascendai/cann) directly with one line command:
+You can use the [container image](https://hub.docker.com/r/ascendai/cann) directly with one line command:
```bash
docker run \
@@ -33,13 +34,13 @@ docker run \
You do not need to install `torch` and `torch_npu` manually, they will be automatically installed as `vllm-ascend` dependencies.
-#### Manual installation
+### Manual installation
-Or follow the instructions provided in the [Ascend Installation Guide](https://ascend.github.io/docs/sources/ascend/quick_install.html) to set up the environment.
+Or follow the instructions provided in the [Ascend Installation Guide](https://ascend.github.io/docs/sources/ascend/quick_install.html) to set up the environment.
-### 3. Building
+## 3. Building
-#### Build Python package from source
+### Build Python package from source
```bash
git clone https://github.com/vllm-project/vllm-ascend.git
@@ -47,7 +48,8 @@ cd vllm-ascend
pip install -e .
```
-#### Build container image from source
+### Build container image from source
+
```bash
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
diff --git a/docs/quick_start.md b/docs/quick_start.md
index 548eb5ac..44c5cc82 100644
--- a/docs/quick_start.md
+++ b/docs/quick_start.md
@@ -1,17 +1,20 @@
# Quick Start
## Prerequisites
+
### Support Devices
+
- Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
- Atlas 800I A2 Inference series (Atlas 800I A2)
### Dependencies
-| Requirement | Supported version | Recommended version | Note |
-|-------------|-------------------| ----------- |------------------------------------------|
-| vLLM | main | main | Required for vllm-ascend |
-| Python | >= 3.9 | [3.10](https://www.python.org/downloads/) | Required for vllm |
-| CANN | >= 8.0.RC2 | [8.0.RC3](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.0.beta1) | Required for vllm-ascend and torch-npu |
-| torch-npu | >= 2.4.0 | [2.5.1rc1](https://gitee.com/ascend/pytorch/releases/tag/v6.0.0.alpha001-pytorch2.5.1) | Required for vllm-ascend |
-| torch | >= 2.4.0 | [2.5.1](https://github.com/pytorch/pytorch/releases/tag/v2.5.1) | Required for torch-npu and vllm |
-Find more about how to setup your environment in [here](docs/environment.md).
\ No newline at end of file
+| Requirement | Supported version | Recommended version | Note |
+| ----------- | ----------------- | ---------------------------------------------------------------------------------------------------- | -------------------------------------- |
+| vLLM | main | main | Required for vllm-ascend |
+| Python | >= 3.9 | [3.10](https://www.python.org/downloads/) | Required for vllm |
+| CANN | >= 8.0.RC2 | [8.0.RC3](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.0.beta1) | Required for vllm-ascend and torch-npu |
+| torch-npu | >= 2.4.0 | [2.5.1rc1](https://gitee.com/ascend/pytorch/releases/tag/v6.0.0.alpha001-pytorch2.5.1) | Required for vllm-ascend |
+| torch | >= 2.4.0 | [2.5.1](https://github.com/pytorch/pytorch/releases/tag/v2.5.1) | Required for torch-npu and vllm |
+
+Find more about how to setup your environment in [here](docs/environment.md).
diff --git a/docs/usage/feature_support.md b/docs/usage/feature_support.md
index b13bbb2d..bc23e394 100644
--- a/docs/usage/feature_support.md
+++ b/docs/usage/feature_support.md
@@ -1,19 +1,19 @@
# Feature Support
-| Feature | Supported | Note |
-|---------|-----------|------|
-| Chunked Prefill | ✗ | Plan in 2025 Q1 |
-| Automatic Prefix Caching | ✅ | Improve performance in 2025 Q1 |
-| LoRA | ✗ | Plan in 2025 Q1 |
-| Prompt adapter | ✅ ||
-| Speculative decoding | ✅ | Improve accuracy in 2025 Q1|
-| Pooling | ✗ | Plan in 2025 Q1 |
-| Enc-dec | ✗ | Plan in 2025 Q1 |
-| Multi Modality | ✅ (LLaVA/Qwen2-vl/Qwen2-audio/internVL)| Add more model support in 2025 Q1 |
-| LogProbs | ✅ ||
-| Prompt logProbs | ✅ ||
-| Async output | ✅ ||
-| Multi step scheduler | ✅ ||
-| Best of | ✅ ||
-| Beam search | ✅ ||
-| Guided Decoding | ✗ | Plan in 2025 Q1 |
+| Feature | Supported | Note |
+| ------------------------ | --------------------------------------- | --------------------------------- |
+| Chunked Prefill | ✗ | Plan in 2025 Q1 |
+| Automatic Prefix Caching | ✅ | Improve performance in 2025 Q1 |
+| LoRA | ✗ | Plan in 2025 Q1 |
+| Prompt adapter | ✅ | |
+| Speculative decoding | ✅ | Improve accuracy in 2025 Q1 |
+| Pooling | ✗ | Plan in 2025 Q1 |
+| Enc-dec | ✗ | Plan in 2025 Q1 |
+| Multi Modality | ✅ (LLaVA/Qwen2-vl/Qwen2-audio/internVL) | Add more model support in 2025 Q1 |
+| LogProbs | ✅ | |
+| Prompt logProbs | ✅ | |
+| Async output | ✅ | |
+| Multi step scheduler | ✅ | |
+| Best of | ✅ | |
+| Beam search | ✅ | |
+| Guided Decoding | ✗ | Plan in 2025 Q1 |
diff --git a/docs/usage/running_vllm_with_ascend.md b/docs/usage/running_vllm_with_ascend.md
index 03de8dd5..50f4d92a 100644
--- a/docs/usage/running_vllm_with_ascend.md
+++ b/docs/usage/running_vllm_with_ascend.md
@@ -1 +1,207 @@
-# Running vLLM with Ascend
\ No newline at end of file
+# Running vLLM with Ascend
+
+## Preparation
+
+### Check CANN Environment
+
+Check your CANN environment:
+
+```bash
+cd /usr/local/Ascend/ascend-toolkit/latest/-linux # : aarch64 or x86_64
+cat ascend_toolkit_install.info
+```
+
+The cann version should >= `8.0.RC2`, for example:
+
+```bash
+package_name=Ascend-cann-toolkit
+version=8.0.RC3
+```
+
+### Check NPU Device
+
+Check your available NPU chips:
+
+```bash
+npu-smi info
+```
+
+### Download Model
+
+Install modelscope:
+
+```bash
+pip install modelscope
+```
+
+Download model with modelscope python sdk:
+
+```python
+# /root/models/model_download.py
+from modelscope import snapshot_download
+
+model_dir = snapshot_download('Qwen/Qwen2.5-7B-Instruct', cache_dir='/root/models')
+```
+
+Start downloading:
+
+```bash
+python model_download.py
+```
+
+To use models from ModelScope instead of HuggingFace Hub, set an environment variable:
+
+```bash
+export VLLM_USE_MODELSCOPE=True
+```
+
+## Offline Inference
+
+### Install vllm and vllm-ascend
+
+Install vllm and vllm-ascend directly with pip:
+
+```bash
+pip install vllm vllm-ascend
+```
+
+### Offline Inference on a Single NPU
+
+Run the following script to execute offline inference on a single NPU:
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = [
+ "Hello, my name is",
+ "The president of the United States is",
+ "The capital of France is",
+ "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
+
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+ prompt = output.prompt
+ generated_text = output.outputs[0].text
+ print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+> [!TIP]
+> You can use your local models for offline inference by replacing the value of `model` in `LLM()` with `path/to/model`, e.g. `/root/models/Qwen/Qwen2.5-7B-Instruct`.
+
+> [!NOTE]
+>
+> - `temperature`: Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random.
+> - `top_p`: Float that controls the cumulative probability of the top tokens to consider.
+
+You can find more information about the sampling parameters [here](https://docs.vllm.ai/en/stable/api/inference_params.html#sampling-params).
+
+If you run this script successfully, you can see the info shown below:
+
+```bash
+Processed prompts: 100%|███████████████████████| 4/4 [00:00<00:00, 4.10it/s, est. speed input: 22.56 toks/s, output: 65.62 toks/s]
+Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
+Prompt: 'The president of the United States is', Generated text: ' Statesman A, and the vice president is Statesman B. If they are'
+Prompt: 'The capital of France is', Generated text: ' the city of Paris. This is a fact that can be found in any geography'
+Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
+```
+
+## Online Serving
+
+### Run Docker Container
+
+Build your docker image using `vllm-ascend/Dockerfile`:
+
+```bash
+docker build -t vllm-ascend:1.0 .
+```
+
+> [!NOTE]
+> `.` is the dir of your Dockerfile.
+
+Launch your container:
+
+```bash
+docker run \
+ --name vllm-ascend \
+ --device /dev/davinci0 \
+ --device /dev/davinci_manager \
+ --device /dev/devmm_svm \
+ --device /dev/hisi_hdc \
+ -v /usr/local/dcmi:/usr/local/dcmi \
+ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+ -v /etc/ascend_install.info:/etc/ascend_install.info \
+ -v /root/models:/root/models \
+ -it vllm-ascend bash
+```
+
+> [!TIP]
+> To use your local model, you should mount your model dir to container, e.g. `-v /root/models:/root/models`.
+
+> [!NOTE]
+> You can set `davinci0 ~ davinci7` to specify a different NPU. Find more info about your device using `npu-smi info`.
+
+### Online Serving on a Single NPU
+
+vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments.
+
+Run the following command to start the vLLM server on a single NPU:
+
+```bash
+vllm serve Qwen/Qwen2.5-7B-Instruct
+```
+
+Once your server is started, you can query the model with input prompts:
+
+```bash
+curl http://localhost:8000/v1/completions \
+ -H "Content-Type: application/json" \
+ -d '{
+ "model": "Qwen/Qwen2.5-7B-Instruct",
+ "prompt": "San Francisco is a",
+ "max_tokens": 7,
+ "temperature": 0
+ }'
+```
+
+> [!TIP]
+> You can use your local models when launching the vllm server or quering the model by replacing the value of `model` with `path/to/model`, e.g. `/root/models/Qwen/Qwen2.5-7B-Instruct`.
+
+If you query the server successfully, you can see the info shown below:
+
+```bash
+...
+```
+
+## Distributed Inference
+
+vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. To run multi-GPU inference with the `LLM` class, set the `tensor_parallel_size` argument to the number of GPUs you want to use.
+
+```python
+from vllm import LLM
+
+prompts = [
+ "Hello, my name is",
+ "The president of the United States is",
+ "The capital of France is",
+ "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", tensor_parallel_size=4)
+
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+ prompt = output.prompt
+ generated_text = output.outputs[0].text
+ print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+If you run this script successfully, you can see the info shown below:
+
+```bash
+...
+```
diff --git a/docs/usage/supported_models.md b/docs/usage/supported_models.md
index edf3df6c..49e0d650 100644
--- a/docs/usage/supported_models.md
+++ b/docs/usage/supported_models.md
@@ -1,24 +1,24 @@
# Supported Models
-| Model | Supported | Note |
-|---------|-----------|------|
-| Qwen 2.5 | ✅ ||
-| Mistral | | Need test |
-| DeepSeek v2.5 | |Need test |
-| LLama3.1/3.2 | ✅ ||
-| Gemma-2 | |Need test|
-| baichuan | |Need test|
-| minicpm | |Need test|
-| internlm | ✅ ||
-| ChatGLM | ✅ ||
-| InternVL 2.5 | ✅ ||
-| Qwen2-VL | ✅ ||
-| GLM-4v | |Need test|
-| Molomo | ✅ ||
-| LLaVA 1.5 | ✅ ||
-| Mllama | |Need test|
-| LLaVA-Next | |Need test|
-| LLaVA-Next-Video | |Need test|
-| Phi-3-Vison/Phi-3.5-Vison | |Need test|
-| Ultravox | |Need test|
-| Qwen2-Audio | ✅ ||
+| Model | Supported | Note |
+| ------------------------- | --------- | --------- |
+| Qwen 2.5 | ✅ | |
+| Mistral | | Need test |
+| DeepSeek v2.5 | | Need test |
+| LLama3.1/3.2 | ✅ | |
+| Gemma-2 | | Need test |
+| baichuan | | Need test |
+| minicpm | | Need test |
+| internlm | ✅ | |
+| ChatGLM | ✅ | |
+| InternVL 2.5 | ✅ | |
+| Qwen2-VL | ✅ | |
+| GLM-4v | | Need test |
+| Molomo | ✅ | |
+| LLaVA 1.5 | ✅ | |
+| Mllama | | Need test |
+| LLaVA-Next | | Need test |
+| LLaVA-Next-Video | | Need test |
+| Phi-3-Vison/Phi-3.5-Vison | | Need test |
+| Ultravox | | Need test |
+| Qwen2-Audio | ✅ | |