-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sgmv_cutlass calculate wrong output #11
Comments
Hmm... That's interesting... BTW, thanks for providing this script. Super helpful for reproducing the bug! We'll take a look at this. In the meanwhile, you can use The following works: import torch
import punica.ops
bs = 4
h1 = 4096
h2 = 32
num_layers = 1
dtype = torch.float16
device = torch.device("cuda:0")
problem_sizes = [2, 2]
w = [
torch.ones((num_layers, h2, h1), dtype=dtype, device=device)
for _ in range(len(problem_sizes))
]
w_ptr = torch.tensor([t.data_ptr() for t in w],
dtype=torch.int64,
device=device)
s = torch.cumsum(
torch.tensor([0] + problem_sizes, device=device),
dim=0,
dtype=torch.int32)
x = torch.ones((s[-1], h1), dtype=dtype, device=device)
y = torch.zeros((s[-1], h2), dtype=dtype, device=device)
punica.ops.sgmv(y, x, w_ptr, s, layer_idx=0)
print(y) |
Thanks for your reply! |
@harryhan618 modern GPUs support transpose at fragment level (with We will support row-major for shrink kernel in the next release. |
@abcdabcd987 @yzh119 import torch
import punica.ops
bs = 1
h1 = 1024
h2 = 64
num_layers = 32
dtype = torch.float16
device = torch.device("cuda:0")
problem_sizes = [1]
w = [
torch.randn((num_layers, h2, h1), dtype=dtype, device=device)
for _ in range(len(problem_sizes))
]
w_ptr = torch.tensor([t.data_ptr() for t in w],
dtype=torch.int64,
device=device)
s = torch.cumsum(
torch.tensor([0] + problem_sizes, device=device),
dim=0,
dtype=torch.int32)
x = torch.ones((s[-1], h1), dtype=dtype, device=device)
y = torch.zeros((s[-1], h2), dtype=dtype, device=device)
# punica.ops.sgmv_cutlass(y, x, w_ptr, s, layer_idx=0)
punica.ops.sgmv(y, x, w_ptr, s, layer_idx=0)
print(y) Output:
|
NVM, I found this is related to shared memory. PR: #20 |
Hi, any updates on why cutlass group gemmed calculate wrong results? |
I just added a few test cases. 0c7cf81 Cutlass only has this problem for shrink. Since we are deprecating cutlass shrink, we probably won't fix this. Before our custom expand lands, you can use |
Hi lequn, I think I found the bug of cutlass_shrink. Please first see cutlass example 24 group gemm. The second parameter for In your code, for shrink, you wrote 4. I think this should be bug. For expand, you wrote 8, which is correct. By the way, to make the code correctly compiled, I have to change Thread Block Shape and Warp Shape to be So I'm also wondering how to choose these shape? Since that's the key difference between shrink and expand. I'm looking forward to see your insight! Thank you! |
This even fixes the performance gap between cutlass-based expand/shrink kernels, which have very similar latency between shrink from 6144 to 128 and expand from 128 to 6144 in my tests. |
I'm running the following code and find the answer goes wrong. I initialize the
x
andw
to be all ones. So the outputy
value should beh1=4096
.But my output is not. Half of the output is 4096 and the other half is 2528. Weird!
My observation is that the wrong answer happens when h2>=32 for shrink.
The following code is adapted from
benchmarks/bench_sgmv_cutlass.py
The text was updated successfully, but these errors were encountered: