Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a DeviceBoundThreadPool class conforming to the ThreadPool interface #18751

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tt-asaigal
Copy link
Contributor

Ticket

Link to Github Issue

Problem description

  • Existing Boost backed thread pool implementation does not guarantee an optimal distribution of work across devices
  • Performance analysis also shows that a custom implementation with threads pinned to physical devices provides lower host overhead

What's changed

  • Add a DeviceBoundThreadPool class conforming to the ThreadPool interface
  • This class is built on top of the NumaAwareExecutor class
  • Can be created using the create_device_bound_thread_pool API
  • Allows submitting tasks to specific threads (each thread is tied to a physical device, and can thus process tasks only for that device)
  • Is NUMA Aware: Threads spawned for a physical device will be pinned to NUMA nodes "closest" to the device

Checklist

  - Built on top of the NumaAwareExecutor class
  - Can be created using the create_device_bound_thread_pool API
  - Allows submitting tasks to specific threads (each thread is tied
    to a physical device, and can thus process tasks only for that device)
  - Is NUMA Aware: Threads spawned for a physical device will be pinned
    to NUMA nodes "closest" to the device
@dmakoviichuk-tt
Copy link
Contributor

There are a lot of good implementations of the threadpool over all the internet. It doesn't make any sense to reimplement the wheel. You will need to support it.
Also your performance comparison showed a difference in a few nanoseconds. Previous implementation from you had a couple of issues and wasn't 100% correct for multithreading. Comparison is not. reallty a good idea.
Please try baseline solution first of all. It doesn't make sense to optimize a few nanoseconds where we loosing microseconds in a lot of other places.
But saves a lot of time for you. I don't want to investigate a random crashes which happens once a day which I am already doing. Why not use existing solution and concentrate on a real problem which we are trying to solve. Bring up fabric?
When everything is ready doesn't crash and fast enough I am ready to think for a future improvements.

Btw for the boost thread here is an easy way to set thread affinities (maybe there is a better way I don't now).

boost::asio::thread_pool pool(4);
 std::vector<boost::future<int>> futures;

 for (int i = 0; i < 4; ++i) {
     auto task = [i]() {
         // Get the native handle
         pthread_t thread = pthread_self();

         // Set the affinity mask to bind to core i
         cpu_set_t cpuset;
         CPU_ZERO(&cpuset);
         CPU_SET(i, &cpuset);

         int result = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

         if (result != 0) {
             std::cerr << "Error setting thread affinity for thread " << i << std::endl;
         } else {
              std::cout << "Thread " << i << " bound to core " << i << std::endl;
         }
       
         return i * i;
     };

     boost::future<int> future = boost::asio::post(pool, task);
     futures.push_back(std::move(future));
 }

@Denys88
Copy link

Denys88 commented Mar 6, 2025

"Existing Boost backed thread pool implementation does not guarantee an optimal distribution of work across devices
"
Please provide example when and how it happens.

}

private:
std::vector<std::unique_ptr<NumaAwareExecutor>> workers_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we have a vector of boost thread pools (backed by 1 thread each), and then follow Denys' suggestion on pinning the threads to cpu cores correspondingly?

@@ -12,10 +12,11 @@ namespace tt::tt_metal {
class ThreadPool {
public:
virtual ~ThreadPool() = default;
virtual void enqueue(std::function<void()>&& f) = 0;
virtual void enqueue(std::function<void()>&& f, uint32_t thread_idx = 0) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change thread_idx to device_idx, and also make it optional. I think it makes sense to support a general case, where we do some work that is not affiliated with a device?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's see if we really need thread_idx on the interface. Seems like we just need uniform distribution of work which can be done underneath the API by better assignment of work to thread

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought the point of this PR is to make sure dispatches related to particular devices go to particular threads? My comment here was to say I treat it as a "hint" not as a requirement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's see what the perf. results show but agree w/ your comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the idea Oleg. I think we can:

  1. Use the hint provided by the caller if given
  2. Maintain internal state and use that for picking a thread, if a hint is not provided
    This way we can ensure an even distribution of work.
    I'll update the arg to be device_idx

Comment on lines 19 to +20
std::shared_ptr<ThreadPool> create_boost_thread_pool(int num_threads);
std::shared_ptr<ThreadPool> create_device_bound_thread_pool(int num_threads, uint32_t logical_cpu_offset = 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some tests would be great - we can just have a simple stress test, but if you compile it and run with --enable-tsan it might show bugs.

@Denys88
Copy link

Denys88 commented Mar 7, 2025

As I understood you need to have a few different pools. One per each numa node?

@tt-asaigal
Copy link
Contributor Author

tt-asaigal commented Mar 7, 2025

Thanks for the feedback and ideas everyone, I appreciate you going through the code.
I wanted to share some motivation behind why this work was done. Attached are 3 images. Each image contains performance profiles for a multi-device write, across different thread pool configurations. This is for Falcon 7B Prefill on T3000. The writes correspond to sending Attention Masks and Input ID tensors to each device, respectively.

A small model was used for illustration purposes, but the same pattern applies across the board.

Boost Thread-Pool
image
@dmakoviichuk-tt: We spawn a thread-pool with 8 threads. Only 5 get used for writing the Attention Masks. 4 get used for writing Input IDs. The rest sleep.
Average Write Time (Attention Masks): 162us
Average Write Time (Input IDs): 97us

Custom Thread-Pool with a Vector of Boost Thread Pools (1 Thread Each)
image
We can now evenly distribute work across all threads - no starvation.
Average Write Time (Attention Masks): 149us
Average Write Time (Input IDs): 58us

Thread-Pool from this PR
image
Average Write Time (Attention Masks): 81us
Average Write Time (Input IDs): 48us

Does this matter for End to End Performance?
It is reasonable to ask whether this affects end to end model performance - its only a few microseconds, after all.
Listed below are the end to end tokens/second we get from each implementation (collected across 5 runs).

Boost Thread-Pool

inference throughput prefill | seq_len=128 : 11222.27964 tok/s
inference throughput prefill | seq_len=128 : 9363.72879 tok/s
inference throughput prefill | seq_len=128 : 11269.54059 tok/s
inference throughput prefill | seq_len=128 : 9640.71464 tok/s
inference throughput prefill | seq_len=128 : 9462.07485 tok/s

Variation: 17%, Min: 9364 tok/s, Max: 11269 tok/s

Custom Thread-Pool with a Vector of Boost Thread Pools (1 Thread Each)

inference throughput prefill | seq_len=128 : 9527.34323 tok/s
inference throughput prefill | seq_len=128 : 9498.29659 tok/s
inference throughput prefill | seq_len=128 : 10881.21348 tok/s
inference throughput prefill | seq_len=128 : 11137.66412 tok/s
inference throughput prefill | seq_len=128 : 9739.78652 tok/s

Variation: 15%, Min: 9498 tok/s, Max: 11137 tok/s

Thread-Pool from this PR

inference throughput prefill | seq_len=128 : 11168.28509 tok/s
inference throughput prefill | seq_len=128 : 11297.39925 tok/s
inference throughput prefill | seq_len=128 : 11288.76431 tok/s
inference throughput prefill | seq_len=128 : 11352.62253 tok/s
inference throughput prefill | seq_len=128 : 11063.24177 tok/s

Variation: 3%, Min: 11063 tok/s, Max: 11353 tok/s

Why does this matter?
The numbers above are from the Falcon 7B demo we have on main. A 15-17% variation in end to end performance is unacceptable.
Additionally, the numbers above imply that if we were to mainline TT-Mesh backed models using the suggested multi-threading approaches, performance tests would fail non-deterministically. This is a blocker for our project.
Further, the overall performance of the first two implementations is subpar compared to main. The final implementation is at parity. This is another blocker for using the suggestions provided.

How can we ensure that a custom implementation is not buggy?
This implementation does not need to be mainlined today. Even if it is, it will not be used by a single workload except for our directed tests.
We propose merging this implementation into our TT-Mesh integration branch. This branch runs every multi-chip test through the new backend, thus stress testing the thread-pool. We automatically run tests every time we push.

The implementation can then be mainlined along with the rest of our work. We will run all stress tests and model tests before merging to main.

In the meantime, I can suggest 2 things:

  1. Find a third-party implementation fitting our use case perfectly and giving us the same performance as the custom thread-pool.
  2. Reviewing this implementation and having an open discussion about it. I would really appreciate this, since it will make our code better and help find any latent bugs.

I hope this answers any outstanding questions. Thanks again everyone :)

@Denys88
Copy link

Denys88 commented Mar 7, 2025

@tt-asaigal could you show your code? how do you run tasks?
Do you push all tasks from the same thread?

If you claim that boost thread pool doesn't use all threads you can create a simple standalone example with this issue.
If you cannot reproduce there might be a bug in the code.
So you see e2e real model difference in a few microseconds?

@@ -478,7 +479,9 @@ void MeshCommandQueue::enqueue_write_shards(
});

for (std::size_t shard_idx = 0; shard_idx < shard_data_transfers.size(); shard_idx++) {
dispatch_thread_pool_->enqueue([&dispatch_lambda, shard_idx]() { dispatch_lambda(shard_idx); });
dispatch_thread_pool_->enqueue(
[&dispatch_lambda, shard_idx]() { dispatch_lambda(shard_idx); },
Copy link

@Denys88 Denys88 Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understood: you create a dispatch lambda which is std::function, then pass it by reference as capture paramter t oanother lambda which will become andother std::function inside of the enqueue. Making dispatch_lambda just lambda might save you some time.
@cfjchu @omilyutin-tt ptal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the std::function and making it a raw lambda is a good optimization, though it doesn't affect the performance numbers I posted above. The function gets allocated once and reused across enqueue calls.

@@ -23,7 +187,7 @@ class BoostThreadPool : public ThreadPool {

~BoostThreadPool() noexcept override = default;

void enqueue(std::function<void()>&& f) override {
void enqueue(std::function<void()>&& f, uint32_t thread_idx) override {
std::packaged_task<void()> task(std::move(f));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can optimize this code and compare vs your solution.
Lets forget about managing futures and just call pool.join() in the wait().
For a experiment you can just remove all future related things for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pool.join() kills the thread pool. It will be unusable once you call join. Identical semantics to thread.join().

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then use pool.wait()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking at the docs, I learnt that both APIs behave the same way: https://live.boost.org/doc/libs/master/doc/html/boost_asio/reference/thread_pool/wait.html.
I have also tested this myself. Here is the branch, with your suggestion: asaigal/thread_pool_experiments

The following test will fail:
./build_Release/test/tt_metal/distributed/distributed_unit_tests_wormhole_b0 --gtest_filter="*ThreadPoolTest*"

If you remove the first wait, it will pass.

Boost does not provide a light-weight API to synchronize threads. We need this for our implementation, which is why we brought up our own class.

@tt-asaigal
Copy link
Contributor Author

tt-asaigal commented Mar 7, 2025

Hey @dmakoviichuk-tt implementations for each thread-pool are pushed to the following branches:

  1. Pure Boost Thread Pool is on main.
    Please see: https://github.com/tenstorrent/tt-metal/blob/main/tt_metal/common/thread_pool.cpp and https://github.com/tenstorrent/tt-metal/blob/main/tt_metal/common/thread_pool.hpp

  2. Custom Thread Pool using a vector of Boost threads: https://github.com/tenstorrent/tt-metal/tree/asaigal/thread_pool_experiments_2.
    Please see: https://github.com/tenstorrent/tt-metal/blob/asaigal/thread_pool_experiments_2/tt_metal/common/thread_pool.cpp and https://github.com/tenstorrent/tt-metal/blob/asaigal/thread_pool_experiments_2/tt_metal/common/thread_pool.hpp

  3. Entirely custom implementation: asaigal/device_bound_thread_pool

All tasks are pushed by the main thread.

A standalone example using boost has been pushed to https://github.com/tenstorrent/tt-metal/tree/asaigal/thread_pool_experiments. You can try this out and verify the behaviour on your end as well.
We used 8 threads for our experiments with end to end performance. The standalone example can be modified to do this as well. Please run:
./build_Release/test/tt_metal/distributed/distributed_unit_tests_wormhole_b0 --gtest_filter="*ThreadPoolTest*" and remove the first call to wait.

Here is the Falcon 7B test:
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest --disable-warnings -q -s --input-method=json --input-path='models/demos/t3000/falcon7b/input_data_t3000.json' models/demos/t3000/falcon7b/demo_t3000.py -k "perf_mode_128_stochastic_verify and -8-"

Our integration branch for testing model performance is: jchu/ttnn-integration-with-mesh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants