Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port Optimized Legacy All-Gather (Width Dim, Interleaved, Tile, Multi-Tile High Tensor) To All-Gather Asyncv #18730

Open
SeanNijjar opened this issue Mar 6, 2025 · 0 comments
Assignees
Labels
llm_t3000 P1 perf for issues tracking performance problems/improvements

Comments

@SeanNijjar
Copy link
Contributor

SeanNijjar commented Mar 6, 2025

First step to try hardcoded "minimal" interleaved all-gather to confirm it is functional or not. If not, then proceed to implement a multi-tile-high supported version

Getting Perf with Interleaved

  • Interleaved, by definition, has only one tile contig in memory at a time
    • If producer sends (fabric multi-cast) to destination, the fabric utilization will be very low:
      -Biggest packet would be 2k, which would only achieve 10GB/s bidir at best
    • Need to coalesce tiles into packets and scatter on receiver side

Basic Dataflow

Diagram below shows basic data flow from one chip to its neighbour. This same structure is mirrored in the reverse direction:

Image

Some Configs:

Generally speaking, these include all the decode shapes but with a much larger height for the tensor (e.g. 2k, 4k, 8k, 16k)
Llama 8B, N300 [128,2048] [128,4096] 2 bf16 3 Interleaved, Tile
T3K Falcon 40, Prefill [2048, 1024] [2048, 8192] 8 DF 3 Interleaved, Tile

@SeanNijjar SeanNijjar added P1 perf for issues tracking performance problems/improvements llm_t3000 labels Mar 6, 2025
@SeanNijjar SeanNijjar changed the title Port Optimized Legacy All-Gather (Interleaved, Tile, Multi-Tile High Tensor) To All-Gather Asyncv Port Optimized Legacy All-Gather (Width Dim, Interleaved, Tile, Multi-Tile High Tensor) To All-Gather Asyncv Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llm_t3000 P1 perf for issues tracking performance problems/improvements
Projects
None yet
Development

No branches or pull requests

2 participants