You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First step to try hardcoded "minimal" interleaved all-gather to confirm it is functional or not. If not, then proceed to implement a multi-tile-high supported version
Getting Perf with Interleaved
Interleaved, by definition, has only one tile contig in memory at a time
If producer sends (fabric multi-cast) to destination, the fabric utilization will be very low:
-Biggest packet would be 2k, which would only achieve 10GB/s bidir at best
Need to coalesce tiles into packets and scatter on receiver side
Basic Dataflow
Diagram below shows basic data flow from one chip to its neighbour. This same structure is mirrored in the reverse direction:
Some Configs:
Generally speaking, these include all the decode shapes but with a much larger height for the tensor (e.g. 2k, 4k, 8k, 16k)
Llama 8B, N300 [128,2048] [128,4096] 2 bf16 3 Interleaved, Tile
T3K Falcon 40, Prefill [2048, 1024] [2048, 8192] 8 DF 3 Interleaved, Tile
The text was updated successfully, but these errors were encountered:
SeanNijjar
changed the title
Port Optimized Legacy All-Gather (Interleaved, Tile, Multi-Tile High Tensor) To All-Gather Asyncv
Port Optimized Legacy All-Gather (Width Dim, Interleaved, Tile, Multi-Tile High Tensor) To All-Gather Asyncv
Mar 6, 2025
First step to try hardcoded "minimal" interleaved all-gather to confirm it is functional or not. If not, then proceed to implement a multi-tile-high supported version
Getting Perf with Interleaved
-Biggest packet would be 2k, which would only achieve 10GB/s bidir at best
Basic Dataflow
Diagram below shows basic data flow from one chip to its neighbour. This same structure is mirrored in the reverse direction:
Some Configs:
Generally speaking, these include all the decode shapes but with a much larger height for the tensor (e.g. 2k, 4k, 8k, 16k)
Llama 8B, N300 [128,2048] [128,4096] 2 bf16 3 Interleaved, Tile
T3K Falcon 40, Prefill [2048, 1024] [2048, 8192] 8 DF 3 Interleaved, Tile
The text was updated successfully, but these errors were encountered: