Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T3K Tensor Parallel Models (CCL) Port to Fabric Master Issue #18709

Open
14 tasks
SeanNijjar opened this issue Mar 6, 2025 · 0 comments
Open
14 tasks

T3K Tensor Parallel Models (CCL) Port to Fabric Master Issue #18709

SeanNijjar opened this issue Mar 6, 2025 · 0 comments
Assignees

Comments

@SeanNijjar
Copy link
Contributor

SeanNijjar commented Mar 6, 2025

Problem Description

All existing T3K model tests that exploit tensor parallelism use legacy CCL ops. These legacy CCL ops use the Erisc Data Mover (EDM) which is launched per operation invocation. This leads to several issues, some of which need resolving now:

  • Slower dispatch time because of dispatch to ethernet
  • implicit cross-chip barriers every CCL op launch and teardown
  • TT-Distributed branch merge requires ops to stop launching to ethernet
    • Ethernet cores are not targetable my tt-distributed mesh programs
    • TT Distributed does not support subdevice on Ethernet cores

In particular, TT-Distributed has very large change ready to merge but blocked by the above.

This work involves porting various CCL op configurations to fabric with performance at parity or better than before. Additionally, to the models themselves must be updated to work with fabric and the async CCLs (which involves some work to setup global semaphores, persistent buffers, etc.)

Perf Targets

Perf targets aren't currently available for all ops on op-by-op basis. Some ops have exact numbers captured. For ops that don't, the general performance target is to reach about 13-15 GB/s bidirectional BW in the operation. Op launch latency should aim to be under 2 or 3 us. In some cases this will be at parity and others it will be substantially better.

Op Work Items (Updated)

After evaluating existing support the following items are needed:

All-Gather

All-Gather-Matmul

Reduce-Scatter

Single tile high cases are expected to be implemented by @sjameelTT. Syncing on Tue, Mar about applicability of approach to a few multi-tile-high

Known Models with Tensor Parallelism and CCLs

Most of these should end up working by updating TT-Transformers but that work hasn't been scoped out yet.

  • Mixtral, T3K
    • Ported
  • Llama 8B, N300
    • Ported
  • Falcon 40, T3K
    • Ported
  • Llama 70B, T3K (including similar models)
    • Ported
    • Qwen 2.5 72B, T3K
      • Ported
    • DeepSeek R1 Distill, T3K
      • Ported
    • Llama 3.3 70B, T3K
      • Ported

Outdated

Op Work Items (Outdated -- see next section for updates)

The following general categories of CCL ops have been identified as in need of porting to the new async CCL approach:
All are tile tensors

Dim Op Memory Layout Shape ShardSpec [optional] Supported/Issue
width all-gather width-sharded single tile high Same shard grid Yes
width all-gather width-sharded single tile high different input/output shard grid Yes
width all-gather interleaved multi-tile high N/A #18730
height all-gather interleaved multi-tile high N/A #18738
width reduce-scatter width sharded single-tile high Width sharded No, @sjameelTT working on variant of this
width reduce-scatter interleaved multi-tile high N/A #18739
width all-gather-matmul width sharded single-tile high Width Sharded #18740
width all-gather-matmul Interleaved multi-tile high N/A #18740

Padded tiles???

  • None observed in Mixtral
    • At most would require changes to the host program
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants