Skip to content

Commit

Permalink
QwQ-32B support (#18691)
Browse files Browse the repository at this point in the history
Add out-of-the-box support for QwQ-32B
  • Loading branch information
yieldthought authored Mar 6, 2025
1 parent 2568edc commit 89a38f6
Show file tree
Hide file tree
Showing 3 changed files with 25 additions and 19 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,9 @@
| [Qwen 2.5 72B (TP=8)](./models/demos/llama3) | 32 | [QuietBox](https://tenstorrent.com/hardware/tt-quietbox) | 333 | 14.5 | 20 | 464.0 | [v0.56.0-rc33](https://github.com/tenstorrent/tt-metal/tree/v0.56.0-rc33) | [9ac3783](https://github.com/tenstorrent/vllm/tree/9ac3783d5e3a4547f879f2cdadaab8571047a0a8) |
| [DeepSeek R1 Distill Llama 3.3 70B (TP=8)](./models/demos/llama3) | 32 | [QuietBox](https://tenstorrent.com/hardware/tt-quietbox) | 180 | 15.2 | 20 | 486.4 | [v0.56.0-rc33](https://github.com/tenstorrent/tt-metal/tree/v0.56.0-rc33) | [9ac3783](https://github.com/tenstorrent/vllm/tree/9ac3783d5e3a4547f879f2cdadaab8571047a0a8) |
| [Falcon 7B (DP=32)](./models/demos/tg/falcon7b) | 1024 | [Galaxy](https://tenstorrent.com/hardware/galaxy) | 223 | 4.8 | 26 | 4915.2 | [v0.56.0-rc6](https://github.com/tenstorrent/tt-metal/tree/v0.56.0-rc6) | |
| [QwQ 32B (TP=8)](./models/demos/llama3) | 32 | [QuietBox](https://tenstorrent.com/hardware/tt-quietbox) | 133 | 25.2 | | 464.0 | [main](https://github.com/tenstorrent/tt-metal/) | [9ac3783](https://github.com/tenstorrent/vllm/tree/9ac3783d5e3a4547f879f2cdadaab8571047a0a8) |

> **Last Update:** March 5, 2025
> **Last Update:** March 6, 2025
>
> **Notes:**
>
Expand Down
39 changes: 21 additions & 18 deletions models/demos/llama3/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
# Llama-like Models

This code can run Llama3 family of models and other similar models including Qwen2.5 and DeepSeek-R1-Distill variants.
This code can run Llama3 family of models and other similar models including QwQ, Qwen2.5 and DeepSeek-R1-Distill variants.

The current version supports the following Llama3 models:
The current version is known to support the following Llama3 models:
- Llama3.2-1B
- Llama3.2-3B
- Llama3.1-8B
- Llama3.2-11B
- Llama3.1-70B (T3000 and TG-only)
- Qwen2.5-7B
- Qwen2.5-72B
- QwQ-32B
- DeepSeek R1 Distill Llama 3.3 70B (T3000 and TG-only)

All the above llama models (with the exception of 70B due to its large size) are compatible and tested on the following Tenstorrent hardware:
Expand All @@ -21,22 +22,6 @@ All the above llama models (with the exception of 70B due to its large size) are
Qwen-7B requires N300
Qwen-72B requires T3K

**Max Context Lengths (text-only)**: All of the compatible model/device combinations support a max prefill context-length of 128k, with the exception of Llama3.1-8B and Llama3.2-11B on N150 which have a max of 64k (due to a lack of memory). To support these large max context-lengths, chunked prefill is performed with different max chunk sizes as shown in the table below.

Max Prefill Chunk Sizes (text-only):
| | N150 | N300 | T3K | TG |
|--------------|---------------|---------------|----------------|-------------|
| Llama3.2-1B | 128k tokens | 128k tokens | 128k tokens | 128k tokens |
| Llama3.2-3B | 8k tokens | 128k tokens | 128k tokens | 128k tokens |
| Llama3.1-8B | 4k tokens | 64k tokens | 128k tokens | 128k tokens |
| Llama3.2-11B | 4k tokens | 64k tokens | 128k tokens | 128k tokens |
| Llama3.1-70B | Not supported | Not supported | 32k tokens | 128k tokens |
| DeepSeek-R1-Distill-Llama3.3-70B | Not supported | Not supported | 32k tokens | 128k tokens |

- These max chunk sizes are specific to max context length 128k and are configured via `MAX_PREFILL_CHUNK_SIZES_DIV1024` in [model_config.py](https://github.com/tenstorrent/tt-metal/blob/main/models/demos/llama3/tt/model_config.py). If the max context length is set to a smaller value using the `max_seq_len` flag (see [Run the demo](#run-the-demo)), these chunk sizes can possibly be increased due to using a smaller KV cache.

**Max Context Lengths (Llama3.2-11B multimodal)**: Llama3.2-11B multimodal is currently only supported on N300 and T3000. On N300, a max prefill context length of 8k is supported, while T3000 supports a max context length of 128k.

## How to Run

### Llama models: download the weights
Expand Down Expand Up @@ -175,3 +160,21 @@ pytest models/demos/llama3/demo/simple_text_demo.py -k "performance and batch-1"
### Expected performance and accuracy

See [PERF.md](PERF.md) for expected performance and accuracy across different configurations.

### Implementation notes

**Chunked prefill (text-only)**: All of the compatible model/device combinations support a max prefill context-length of 128k, with the exception of Llama3.1-8B and Llama3.2-11B on N150 which have a max of 64k (due to a lack of memory). To support these large max context-lengths, chunked prefill is performed with different max chunk sizes as shown in the table below.

Max Prefill Chunk Sizes (text-only):
| | N150 | N300 | T3K | TG |
|--------------|---------------|---------------|----------------|-------------|
| Llama3.2-1B | 128k tokens | 128k tokens | 128k tokens | 128k tokens |
| Llama3.2-3B | 8k tokens | 128k tokens | 128k tokens | 128k tokens |
| Llama3.1-8B | 4k tokens | 64k tokens | 128k tokens | 128k tokens |
| Llama3.2-11B | 4k tokens | 64k tokens | 128k tokens | 128k tokens |
| Llama3.1-70B | Not supported | Not supported | 32k tokens | 128k tokens |
| DeepSeek-R1-Distill-Llama3.3-70B | Not supported | Not supported | 32k tokens | 128k tokens |

- These max chunk sizes are specific to max context length 128k and are configured via `MAX_PREFILL_CHUNK_SIZES_DIV1024` in [model_config.py](https://github.com/tenstorrent/tt-metal/blob/main/models/demos/llama3/tt/model_config.py). If the max context length is set to a smaller value using the `max_seq_len` flag (see [Run the demo](#run-the-demo)), these chunk sizes can possibly be increased due to using a smaller KV cache.

**Chunked prefill (Llama3.2-11B multimodal)**: Llama3.2-11B multimodal is currently only supported on N300 and T3000. On N300, a max prefill context length of 8k is supported, while T3000 supports a max context length of 128k.
2 changes: 2 additions & 0 deletions models/demos/llama3/tt/model_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,7 @@ def __init__(
"Qwen2.5-7B": {"N150": 4, "N300": 64, "T3K": 128, "TG": 128},
"Qwen2.5-72B": {"N150": None, "N300": None, "T3K": 32, "TG": 128},
"Phi-3.5-mini-instruct": {"N150": 128, "N300": 128, "T3K": 128, "TG": 128},
"QwQ-32B": {"N150": None, "N300": None, "T3K": 64, "TG": 128},
}
try:
max_prefill_chunk_size_div1024 = MAX_PREFILL_CHUNK_SIZES_DIV1024[self.base_model_name][self.device_name]
Expand Down Expand Up @@ -1034,6 +1035,7 @@ def _set_params_from_dict(self, params):
default_padded_cores = {
"Qwen2.5-72B": 32,
"Qwen2.5-7B": 16,
"QwQ-32B": 16,
}.get(self.base_model_name, 0)

# Override MLP padding cores from env var
Expand Down

0 comments on commit 89a38f6

Please sign in to comment.