Replies: 3 comments 1 reply
-
At least part of the reason is that text generation is inherently memory-bound. The absolute minimum latency for generating a single token is the time it takes to stream the entire language model from VRAM to registers. In this respect (VRAM bandwidth) the MI100 is only about 20% faster than the 3090. Generating in large batches or feeding long sequences through the model (which is the same thing, essentially), you can start to take advantage of GEMM algorithms that leverage the multi-tiered memory architecture of GPUs, by increasing the ratio of computation to global memory access. But token-by-token inference is ultimately a long series of memory-bound GEMV operations, and just not very compute-heavy at the end of the day. Of course, with that you should still be getting 20% more tokens per second on the MI100. But then the second thing is that ExLlama isn't written with AMD devices in mind. I don't own any and while HIPifying the code seems to work for the most part, I can't actually test this myself, let alone optimize for a range of AMD GPUs. |
Beta Was this translation helpful? Give feedback.
-
Maybe you could try MLC LLM. I haven't tried it yet, as I do not have a lot of time these days. They claim to be as fast or faster than exllama on NVIDIA GPU, and they also claim to have equivalent speed using ROCm on AMD GPU. They also support Vulkan, but it seems to be slower than CUDA/ROCm. The potential or current problems are that they don't support Multi-GPU, they use a different quantization formats and I couldn't see perplexity results from it. They also lack integrations, they are not a lot of models directly available in their format and popular UI like ooba are not yet compatible with it. |
Beta Was this translation helpful? Give feedback.
-
MLC LLM gave me 70 tok/s on 13b in my testing, but sadly cant currently split the larger models across GPUs |
Beta Was this translation helpful? Give feedback.
-
I have two machines setup to test AIs. One has a pair of MI100s and the other has a 3090 and P40. Looking at the specs the 3090 is 2X the speed of the MI100 with FP32 operations but the MI100 is 4.5X the speed of the 3090 at FP16 operations. My testing shows the MI100 to always be about 1/2 the speed of the 3090. I was under the impression that the math done for LLMs is generally FP16, is this correct? And if so, what might be the reason my MI100 setup is so slow? FYI, I am testing with a 33B parameter LLM to keep to just one card for now. Thanks for the help.
Beta Was this translation helpful? Give feedback.
All reactions