From c71743c57e8e232049e1215992b20c1a8aea15b2 Mon Sep 17 00:00:00 2001
From: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Date: Wed, 12 Feb 2025 16:55:33 +0800
Subject: [PATCH] add vllm-ascend usage doc & fix doc format

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
---
 docs/index.md                          |  13 +-
 docs/installation.md                   |  32 ++--
 docs/quick_start.md                    |  19 ++-
 docs/usage/feature_support.md          |  34 ++--
 docs/usage/running_vllm_with_ascend.md | 208 ++++++++++++++++++++++++-
 docs/usage/supported_models.md         |  44 +++---
 6 files changed, 281 insertions(+), 69 deletions(-)
diff --git a/docs/index.md b/docs/index.md
index 860501b3..d013e6eb 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,15 +1,16 @@
 # Ascend plugin for vLLM
+
 vLLM Ascend plugin (vllm-ascend) is a community maintained hardware plugin for running vLLM on the Ascend NPU.
 
-This plugin is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162), providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM.
+This plugin is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [<u>[RFC]: Hardware pluggable</u>](https://github.com/vllm-project/vllm/issues/11162), providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM.
 
 By using vLLM Ascend plugin, popular open-source models, including Transformer-like, Mixture-of-Expert, Embedding, Multi-modal LLMs can run seamlessly on the Ascend NPU.
 
 ## Contents
 
-- [Quick Start](./quick_start.md)
-- [Installation](./installation.md)
+- [<u>Quick Start</u>](./quick_start.md)
+- [<u>Installation</u>](./installation.md)
 - Usage
-  - [Running vLLM with Ascend](./usage/running_vllm_with_ascend.md)
-  - [Feature Support](./usage/feature_support.md)
-  - [Supported Models](./usage/supported_models.md)
+  - [<u>Running vLLM with Ascend</u>](./usage/running_vllm_with_ascend.md)
+  - [<u>Feature Support</u>](./usage/feature_support.md)
+  - [<u>Supported Models</u>](./usage/supported_models.md)
diff --git a/docs/installation.md b/docs/installation.md
index d2646d52..f7f1ef8f 100644
--- a/docs/installation.md
+++ b/docs/installation.md
@@ -1,20 +1,21 @@
 # Installation
 
-### 1. Dependencies
-| Requirement  | Supported version | Recommended version | Note |
-| ------------ | ------- | ----------- | ----------- | 
-| Python | >= 3.9 | [3.10](https://www.python.org/downloads/) | Required for vllm |
-| CANN         | >= 8.0.RC2 | [8.0.RC3](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.0.beta1) | Required for vllm-ascend and torch-npu |
-| torch-npu    | >= 2.4.0   | [2.5.1rc1](https://gitee.com/ascend/pytorch/releases/tag/v6.0.0.alpha001-pytorch2.5.1)    | Required for vllm-ascend |
-| torch        | >= 2.4.0   | [2.5.1](https://github.com/pytorch/pytorch/releases/tag/v2.5.1)      | Required for torch-npu and vllm required |
+## 1. Dependencies
 
-### 2. Prepare Ascend NPU environment
+| Requirement | Supported version |                                         Recommended version                                          |                   Note                   |
+| ----------- | ----------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------- |
+| Python      | >= 3.9            | [3.10](https://www.python.org/downloads/)                                                            | Required for vllm                        |
+| CANN        | >= 8.0.RC2        | [8.0.RC3](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.0.beta1) | Required for vllm-ascend and torch-npu   |
+| torch-npu   | >= 2.4.0          | [2.5.1rc1](https://gitee.com/ascend/pytorch/releases/tag/v6.0.0.alpha001-pytorch2.5.1)               | Required for vllm-ascend                 |
+| torch       | >= 2.4.0          | [2.5.1](https://github.com/pytorch/pytorch/releases/tag/v2.5.1)                                      | Required for torch-npu and vllm required |
+
+## 2. Prepare Ascend NPU environment
 
 Below is a quick note to install recommended version software:
 
-#### Containerized installation
+### Containerized installation
 
-You can use the [container image](https://hub.docker.com/r/ascendai/cann) directly with one line command:
+You can use the [<u>container image</u>](https://hub.docker.com/r/ascendai/cann) directly with one line command:
 
 ```bash
 docker run \
@@ -33,13 +34,13 @@ docker run \
 
 You do not need to install `torch` and `torch_npu` manually, they will be automatically installed as `vllm-ascend` dependencies.
 
-#### Manual installation
+### Manual installation
 
-Or follow the instructions provided in the [Ascend Installation Guide](https://ascend.github.io/docs/sources/ascend/quick_install.html) to set up the environment.
+Or follow the instructions provided in the [<u>Ascend Installation Guide</u>](https://ascend.github.io/docs/sources/ascend/quick_install.html) to set up the environment.
 
-### 3. Building
+## 3. Building
 
-#### Build Python package from source
+### Build Python package from source
 
 ```bash
 git clone https://github.com/vllm-project/vllm-ascend.git
@@ -47,7 +48,8 @@ cd vllm-ascend
 pip install -e .
 ```
 
-#### Build container image from source
+### Build container image from source
+
 ```bash
 git clone https://github.com/vllm-project/vllm-ascend.git
 cd vllm-ascend
diff --git a/docs/quick_start.md b/docs/quick_start.md
index 548eb5ac..44c5cc82 100644
--- a/docs/quick_start.md
+++ b/docs/quick_start.md
@@ -1,17 +1,20 @@
 # Quick Start
 
 ## Prerequisites
+
 ### Support Devices
+
 - Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
 - Atlas 800I A2 Inference series (Atlas 800I A2)
 
 ### Dependencies
-| Requirement | Supported version | Recommended version | Note                                     |
-|-------------|-------------------| ----------- |------------------------------------------|
-| vLLM        | main              | main | Required for vllm-ascend                 |
-| Python      | >= 3.9            | [3.10](https://www.python.org/downloads/) | Required for vllm                        |
-| CANN        | >= 8.0.RC2        | [8.0.RC3](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.0.beta1) | Required for vllm-ascend and torch-npu   |
-| torch-npu   | >= 2.4.0          | [2.5.1rc1](https://gitee.com/ascend/pytorch/releases/tag/v6.0.0.alpha001-pytorch2.5.1)    | Required for vllm-ascend                 |
-| torch       | >= 2.4.0          | [2.5.1](https://github.com/pytorch/pytorch/releases/tag/v2.5.1)      | Required for torch-npu and vllm |
 
-Find more about how to setup your environment in [here](docs/environment.md).
\ No newline at end of file
+| Requirement | Supported version |                                         Recommended version                                          |                  Note                  |
+| ----------- | ----------------- | ---------------------------------------------------------------------------------------------------- | -------------------------------------- |
+| vLLM        | main              | main                                                                                                 | Required for vllm-ascend               |
+| Python      | >= 3.9            | [3.10](https://www.python.org/downloads/)                                                            | Required for vllm                      |
+| CANN        | >= 8.0.RC2        | [8.0.RC3](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.0.beta1) | Required for vllm-ascend and torch-npu |
+| torch-npu   | >= 2.4.0          | [2.5.1rc1](https://gitee.com/ascend/pytorch/releases/tag/v6.0.0.alpha001-pytorch2.5.1)               | Required for vllm-ascend               |
+| torch       | >= 2.4.0          | [2.5.1](https://github.com/pytorch/pytorch/releases/tag/v2.5.1)                                      | Required for torch-npu and vllm        |
+
+Find more about how to setup your environment in [<u>here</u>](docs/environment.md).
diff --git a/docs/usage/feature_support.md b/docs/usage/feature_support.md
index b13bbb2d..bc23e394 100644
--- a/docs/usage/feature_support.md
+++ b/docs/usage/feature_support.md
@@ -1,19 +1,19 @@
 # Feature Support
 
-| Feature | Supported | Note |
-|---------|-----------|------|
-| Chunked Prefill | ✗ | Plan in 2025 Q1 |
-| Automatic Prefix Caching | ✅ | Improve performance in 2025 Q1 |
-| LoRA | ✗ | Plan in 2025 Q1 |
-| Prompt adapter | ✅ ||
-| Speculative decoding | ✅ | Improve accuracy in 2025 Q1|
-| Pooling | ✗ | Plan in 2025 Q1 |
-| Enc-dec | ✗ | Plan in 2025 Q1 |
-| Multi Modality | ✅ (LLaVA/Qwen2-vl/Qwen2-audio/internVL)| Add more model support in 2025 Q1 |
-| LogProbs | ✅ ||
-| Prompt logProbs | ✅ ||
-| Async output | ✅ ||
-| Multi step scheduler | ✅ ||
-| Best of | ✅ ||
-| Beam search | ✅ ||
-| Guided Decoding | ✗ | Plan in 2025 Q1 |
+|         Feature          |                Supported                |               Note                |
+| ------------------------ | --------------------------------------- | --------------------------------- |
+| Chunked Prefill          | ✗                                       | Plan in 2025 Q1                   |
+| Automatic Prefix Caching | ✅                                       | Improve performance in 2025 Q1    |
+| LoRA                     | ✗                                       | Plan in 2025 Q1                   |
+| Prompt adapter           | ✅                                       |                                   |
+| Speculative decoding     | ✅                                       | Improve accuracy in 2025 Q1       |
+| Pooling                  | ✗                                       | Plan in 2025 Q1                   |
+| Enc-dec                  | ✗                                       | Plan in 2025 Q1                   |
+| Multi Modality           | ✅ (LLaVA/Qwen2-vl/Qwen2-audio/internVL) | Add more model support in 2025 Q1 |
+| LogProbs                 | ✅                                       |                                   |
+| Prompt logProbs          | ✅                                       |                                   |
+| Async output             | ✅                                       |                                   |
+| Multi step scheduler     | ✅                                       |                                   |
+| Best of                  | ✅                                       |                                   |
+| Beam search              | ✅                                       |                                   |
+| Guided Decoding          | ✗                                       | Plan in 2025 Q1                   |
diff --git a/docs/usage/running_vllm_with_ascend.md b/docs/usage/running_vllm_with_ascend.md
index 03de8dd5..50f4d92a 100644
--- a/docs/usage/running_vllm_with_ascend.md
+++ b/docs/usage/running_vllm_with_ascend.md
@@ -1 +1,207 @@
-# Running vLLM with Ascend
\ No newline at end of file
+# Running vLLM with Ascend
+
+## Preparation
+
+### Check CANN Environment
+
+Check your CANN environment:
+
+```bash
+cd /usr/local/Ascend/ascend-toolkit/latest/<arch>-linux  # <arch>: aarch64 or x86_64
+cat ascend_toolkit_install.info
+```
+
+The cann version should >= `8.0.RC2`, for example:
+
+```bash
+package_name=Ascend-cann-toolkit
+version=8.0.RC3
+```
+
+### Check NPU Device
+
+Check your available NPU chips:
+
+```bash
+npu-smi info
+```
+
+### Download Model
+
+Install modelscope:
+
+```bash
+pip install modelscope
+```
+
+Download model with modelscope python sdk:
+
+```python
+# /root/models/model_download.py
+from modelscope import snapshot_download
+
+model_dir = snapshot_download('Qwen/Qwen2.5-7B-Instruct', cache_dir='/root/models')
+```
+
+Start downloading:
+
+```bash
+python model_download.py
+```
+
+To use models from ModelScope instead of HuggingFace Hub, set an environment variable:
+
+```bash
+export VLLM_USE_MODELSCOPE=True
+```
+
+## Offline Inference
+
+### Install vllm and vllm-ascend
+
+Install vllm and vllm-ascend directly with pip:
+
+```bash
+pip install vllm vllm-ascend
+```
+
+### Offline Inference on a Single NPU
+
+Run the following script to execute offline inference on a single NPU:
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
+
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+> [!TIP]
+> You can use your local models for offline inference by replacing the value of `model` in `LLM()` with `path/to/model`, e.g. `/root/models/Qwen/Qwen2.5-7B-Instruct`.
+
+> [!NOTE]
+>
+> - `temperature`: Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random.
+> - `top_p`: Float that controls the cumulative probability of the top tokens to consider.
+
+You can find more information about the sampling parameters [<u>here</u>](https://docs.vllm.ai/en/stable/api/inference_params.html#sampling-params).
+
+If you run this script successfully, you can see the info shown below:
+
+```bash
+Processed prompts: 100%|███████████████████████| 4/4 [00:00<00:00,  4.10it/s, est. speed input: 22.56 toks/s, output: 65.62 toks/s]
+Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
+Prompt: 'The president of the United States is', Generated text: ' Statesman A, and the vice president is Statesman B. If they are'
+Prompt: 'The capital of France is', Generated text: ' the city of Paris. This is a fact that can be found in any geography'
+Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
+```
+
+## Online Serving
+
+### Run Docker Container
+
+Build your docker image using `vllm-ascend/Dockerfile`:
+
+```bash
+docker build -t vllm-ascend:1.0 .
+```
+
+> [!NOTE]
+> `.` is the dir of your Dockerfile.
+
+Launch your container:
+
+```bash
+docker run \
+    --name vllm-ascend \
+    --device /dev/davinci0 \
+    --device /dev/davinci_manager \
+    --device /dev/devmm_svm \
+    --device /dev/hisi_hdc \
+    -v /usr/local/dcmi:/usr/local/dcmi \
+    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+    -v /etc/ascend_install.info:/etc/ascend_install.info \
+    -v /root/models:/root/models \
+    -it vllm-ascend bash
+```
+
+> [!TIP]
+> To use your local model, you should mount your model dir to container, e.g. `-v /root/models:/root/models`.
+
+> [!NOTE]
+> You can set `davinci0 ~ davinci7` to specify a different NPU. Find more info about your device using `npu-smi info`.
+
+### Online Serving on a Single NPU
+
+vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments.
+
+Run the following command to start the vLLM server on a single NPU:
+
+```bash
+vllm serve Qwen/Qwen2.5-7B-Instruct
+```
+
+Once your server is started, you can query the model with input prompts:
+
+```bash
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen/Qwen2.5-7B-Instruct",
+        "prompt": "San Francisco is a",
+        "max_tokens": 7,
+        "temperature": 0
+    }'
+```
+
+> [!TIP]
+> You can use your local models when launching the vllm server or quering the model by replacing the value of `model` with `path/to/model`, e.g. `/root/models/Qwen/Qwen2.5-7B-Instruct`.
+
+If you query the server successfully, you can see the info shown below:
+
+```bash
+...
+```
+
+## Distributed Inference
+
+vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. To run multi-GPU inference with the `LLM` class, set the `tensor_parallel_size` argument to the number of GPUs you want to use.
+
+```python
+from vllm import LLM
+
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", tensor_parallel_size=4)
+
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+If you run this script successfully, you can see the info shown below:
+
+```bash
+...
+```
diff --git a/docs/usage/supported_models.md b/docs/usage/supported_models.md
index edf3df6c..49e0d650 100644
--- a/docs/usage/supported_models.md
+++ b/docs/usage/supported_models.md
@@ -1,24 +1,24 @@
 # Supported Models
 
-| Model | Supported | Note |
-|---------|-----------|------|
-| Qwen 2.5 | ✅ ||
-| Mistral |  | Need test |
-| DeepSeek v2.5 | |Need test |
-| LLama3.1/3.2 | ✅ ||
-| Gemma-2 |  |Need test|
-| baichuan |  |Need test|
-| minicpm |  |Need test|
-| internlm | ✅ ||
-| ChatGLM | ✅ ||
-| InternVL 2.5 | ✅ ||
-| Qwen2-VL | ✅ ||
-| GLM-4v |  |Need test|
-| Molomo | ✅ ||
-| LLaVA 1.5 | ✅ ||
-| Mllama |  |Need test|
-| LLaVA-Next |  |Need test|
-| LLaVA-Next-Video |  |Need test|
-| Phi-3-Vison/Phi-3.5-Vison |  |Need test|
-| Ultravox |  |Need test|
-| Qwen2-Audio | ✅ ||
+|           Model           | Supported |   Note    |
+| ------------------------- | --------- | --------- |
+| Qwen 2.5                  | ✅         |           |
+| Mistral                   |           | Need test |
+| DeepSeek v2.5             |           | Need test |
+| LLama3.1/3.2              | ✅         |           |
+| Gemma-2                   |           | Need test |
+| baichuan                  |           | Need test |
+| minicpm                   |           | Need test |
+| internlm                  | ✅         |           |
+| ChatGLM                   | ✅         |           |
+| InternVL 2.5              | ✅         |           |
+| Qwen2-VL                  | ✅         |           |
+| GLM-4v                    |           | Need test |
+| Molomo                    | ✅         |           |
+| LLaVA 1.5                 | ✅         |           |
+| Mllama                    |           | Need test |
+| LLaVA-Next                |           | Need test |
+| LLaVA-Next-Video          |           | Need test |
+| Phi-3-Vison/Phi-3.5-Vison |           | Need test |
+| Ultravox                  |           | Need test |
+| Qwen2-Audio               | ✅         |           |