You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All existing T3K model tests that exploit tensor parallelism use legacy CCL ops. These legacy CCL ops use the Erisc Data Mover (EDM) which is launched per operation invocation. This leads to several issues, some of which need resolving now:
Slower dispatch time because of dispatch to ethernet
implicit cross-chip barriers every CCL op launch and teardown
TT-Distributed branch merge requires ops to stop launching to ethernet
Ethernet cores are not targetable my tt-distributed mesh programs
TT Distributed does not support subdevice on Ethernet cores
In particular, TT-Distributed has very large change ready to merge but blocked by the above.
This work involves porting various CCL op configurations to fabric with performance at parity or better than before. Additionally, to the models themselves must be updated to work with fabric and the async CCLs (which involves some work to setup global semaphores, persistent buffers, etc.)
Perf Targets
Perf targets aren't currently available for all ops on op-by-op basis. Some ops have exact numbers captured. For ops that don't, the general performance target is to reach about 13-15 GB/s bidirectional BW in the operation. Op launch latency should aim to be under 2 or 3 us. In some cases this will be at parity and others it will be substantially better.
Op Work Items (Updated)
After evaluating existing support the following items are needed:
All-Gather
Enable width-sharded hand-crafted all-gather more generally (at minimum for all required test cases)
Note: this variant should work for any all-gathers that are either:
single-tile-high
single-tile (or elem if not height) for all dims of input tensor that are outer to the concat dim . The concat-dim can be multi-tile as well (this is a generalization of the previous point)
Problem Description
All existing T3K model tests that exploit tensor parallelism use legacy CCL ops. These legacy CCL ops use the Erisc Data Mover (EDM) which is launched per operation invocation. This leads to several issues, some of which need resolving now:
In particular, TT-Distributed has very large change ready to merge but blocked by the above.
This work involves porting various CCL op configurations to fabric with performance at parity or better than before. Additionally, to the models themselves must be updated to work with fabric and the async CCLs (which involves some work to setup global semaphores, persistent buffers, etc.)
Perf Targets
Perf targets aren't currently available for all ops on op-by-op basis. Some ops have exact numbers captured. For ops that don't, the general performance target is to reach about 13-15 GB/s bidirectional BW in the operation. Op launch latency should aim to be under 2 or 3 us. In some cases this will be at parity and others it will be substantially better.
Op Work Items (Updated)
After evaluating existing support the following items are needed:
All-Gather
Enable width-sharded hand-crafted all-gather more generally (at minimum for all required test cases)
Enable interleaved hand-crafted all-gather more generally
Enable functional version of "handcrafted" interleaved all-gather for variants not supported by existing one:
Enable all-gather optimization for interleaved tensors
Enable ring mode (optimization) for all-gather
All-Gather-Matmul
Reduce-Scatter
Single tile high cases are expected to be implemented by @sjameelTT. Syncing on Tue, Mar about applicability of approach to a few multi-tile-high
Known Models with Tensor Parallelism and CCLs
Most of these should end up working by updating TT-Transformers but that work hasn't been scoped out yet.
Outdated
Op Work Items (Outdated -- see next section for updates)
The following general categories of CCL ops have been identified as in need of porting to the new async CCL approach:
All are tile tensors
Padded tiles???
The text was updated successfully, but these errors were encountered: