Skip to content

Commit 735da02

Browse files
authored
Update README.md
1 parent 44a4cca commit 735da02

File tree

1 file changed

+1
-121
lines changed

1 file changed

+1
-121
lines changed

demo_trt_llm/README.md

+1-121
Original file line numberDiff line numberDiff line change
@@ -1,121 +1 @@
1-
# Run VILA demo on x86_64 machine
2-
3-
## Build TensorRT-LLM
4-
The first step to build TensorRT-LLM is to fetch the sources:
5-
```bash
6-
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
7-
apt-get update && apt-get -y install git git-lfs
8-
git lfs install
9-
10-
git clone https://github.com/NVIDIA/TensorRT-LLM.git
11-
cd TensorRT-LLM
12-
git checkout 66ef1df492f7bc9c8eeb01d7e14db01838e3f0bd
13-
git submodule update --init --recursive
14-
git lfs pull
15-
```
16-
Create a TensorRT-LLM Docker image and approximate disk space required to build the image is 63 GB:
17-
```bash
18-
make -C docker release_build
19-
```
20-
21-
After launching the docker image, please install the following dependency:
22-
```bash
23-
pip install git+https://github.com/bfshi/scaling_on_scales.git
24-
pip install git+https://github.com/huggingface/transformers@v4.36.2
25-
```
26-
## Build TensorRT engine of VILA model
27-
28-
### For VILA 1.0:
29-
30-
Please refer to the [documentation from TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal#llava-and-vila) to deploy the model.
31-
32-
### For VILA 1.5:
33-
34-
1. Setup
35-
```bash
36-
# clone vila
37-
git clone https://github.com/Efficient-Large-Model/VILA.git
38-
39-
# enter the demo folder
40-
cd <VILA-repo>/demo_trt_llm
41-
42-
# apply patch to /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py for vila1.5
43-
sh apply_patch.sh
44-
45-
# download vila checkpoint
46-
export MODEL_NAME="vila1.5-2.7b"
47-
git clone https://huggingface.co/Efficient-Large-Model/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
48-
```
49-
50-
2. TensorRT Engine building using `FP16` and inference
51-
52-
Build TensorRT engine for LLaMA part of VILA from HF checkpoint using `FP16`:
53-
```bash
54-
python convert_checkpoint.py \
55-
--model_dir tmp/hf_models/${MODEL_NAME} \
56-
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
57-
--dtype float16
58-
59-
trtllm-build \
60-
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
61-
--output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
62-
--gemm_plugin float16 \
63-
--use_fused_mlp \
64-
--max_batch_size 1 \
65-
--max_input_len 2048 \
66-
--max_output_len 512 \
67-
--max_multimodal_len 4096
68-
```
69-
70-
3. Build TensorRT engines for visual components
71-
72-
```bash
73-
python build_visual_engine.py --model_path tmp/hf_models/${MODEL_NAME} --model_type vila --vila_path ../
74-
```
75-
76-
4. Run the example script
77-
```bash
78-
python run.py \
79-
--max_new_tokens 100 \
80-
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
81-
--visual_engine_dir visual_engines/${MODEL_NAME} \
82-
--llm_engine_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
83-
--image_file=av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
84-
--input_text="<image>\n<image>\n Please elaborate what you see in the images?" \
85-
--run_profiling
86-
87-
# example output:
88-
...
89-
[Q] <image>\n<image>\n Please elaborate what you see in the images?
90-
[04/30/2024-21:32:11] [TRT-LLM] [I]
91-
[A] ['The first image shows a busy street scene with a car driving through a crosswalk. There are several people walking on the sidewalk, and a cyclist is also visible. The second image captures a beautiful sunset with the iconic Merlion statue spouting water into the water body in the foreground. The Merlion statue is a famous landmark in Singapore, and the water spout is a popular feature of the statue.']
92-
...
93-
```
94-
95-
5. (Optional) One can also use VILA with other quantization options, like SmoothQuant and INT4 AWQ, that are supported by LLaMA. Instructions in LLaMA README to enable SmoothQuant and INT4 AWQ can be re-used to generate quantized TRT engines for LLM component of VILA.
96-
```bash
97-
python quantization/quantize.py \
98-
--model_dir tmp/hf_models/${MODEL_NAME} \
99-
--output_dir tmp/trt_models/${MODEL_NAME}/int4_awq/1-gpu \
100-
--dtype float16 \
101-
--qformat int4_awq \
102-
--calib_size 32
103-
104-
trtllm-build \
105-
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/int4_awq/1-gpu \
106-
--output_dir trt_engines/${MODEL_NAME}/int4_awq/1-gpu \
107-
--gemm_plugin float16 \
108-
--max_batch_size 1 \
109-
--max_input_len 2048 \
110-
--max_output_len 512 \
111-
--max_multimodal_len 4096
112-
113-
python run.py \
114-
--max_new_tokens 100 \
115-
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
116-
--visual_engine_dir visual_engines/${MODEL_NAME} \
117-
--llm_engine_dir trt_engines/${MODEL_NAME}/int4_awq/1-gpu \
118-
--image_file=av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
119-
--input_text="<image>\n<image>\n Please elaborate what you see in the images?" \
120-
--run_profiling
121-
```
1+
## Please refer to the [TensorRT-LLM example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal#llava-and-vila) for VILA deployment.

0 commit comments

Comments
 (0)