Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[ScaleMM] Add a shape dependent max_swizzle size (#137681)
# Summary I started to explore the performance of _scaled_mm against a triton-based persistent TMA kernel for RowWise scaling. There are more details here: drisspg/transformer_nuggets#36 It clearly showed that where was some room for improvement on larger problem sizes compared to triton's performance. Note that the triton kernel only has a 128x128x128 Tile shape, where scaled_mm has a 64, 128, 128 tile shape which we use for smaller problem sizes which may explain some of the perf delta for at smaller shapes. This led to seeing if we can improve our triton codegen lowering for _scaled_mm (I think we should still do this: #137517). In the meantime @Chillee suggested I make sure swizziling is set for the large matmul shapes This PR makes sure that we increase the max_swizzle_size for the large matmuls. ## Performance Note* Red means triton based tma beats _scaled_mm blue means _scaled_mm is faster On Nighlty W/ Triton at (2ef33c6c4c3)  You can see that as M,K,N increase there is a clear win for the Triton Persistent TMA. After this PR:  For example w/ this change(power limited gpu) M=16384 K=16384 N=16384 TFlops Before :`985.49` TFlops After: `1304.69` Pull Request resolved: #137681 Approved by: https://github.com/eqy
- Loading branch information