Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix std::future status query #596

Open
wants to merge 10 commits into
base: branch-25.04
Choose a base branch
from

Conversation

kingcrimsontianyu
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu commented Jan 26, 2025

Pending performance regression check!!

This PR fixes the std::future status query according to the discussion in #593 .

  • The future objects of parallel I/O operations are no longer created using std::async with the deferred launch policy. Doing so would cause the future status to be falsely reported as done. Instead, as suggested, they are created by the underlying thread pool (more specifically by the promise objects). This enables non-blocking future status query.
  • The trivial I/O cases that do not involve task splitting are now performed in a synchronous, blocking way as suggested, instead of being evaluated lazily via std::async with the deferred launch policy.

@kingcrimsontianyu kingcrimsontianyu added bug Something isn't working non-breaking Introduces a non-breaking change c++ Affects the C++ API of KvikIO labels Jan 26, 2025
@kingcrimsontianyu kingcrimsontianyu self-assigned this Jan 26, 2025
Copy link

copy-pr-bot bot commented Jan 26, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

1 similar comment
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

@vyasr
Copy link
Contributor

vyasr commented Jan 27, 2025

Discussing further in #593 rather than here.

@kingcrimsontianyu kingcrimsontianyu changed the title Take deferred std::future into account for status query [Experimental] Take deferred std::future into account for status query Jan 27, 2025
@kingcrimsontianyu kingcrimsontianyu changed the base branch from branch-25.02 to branch-25.04 January 29, 2025 05:42
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

@kingcrimsontianyu kingcrimsontianyu changed the title [Experimental] Take deferred std::future into account for status query [Experimental] Fix std::future status query Jan 29, 2025
@kingcrimsontianyu kingcrimsontianyu added breaking Introduces a breaking change and removed non-breaking Introduces a non-breaking change labels Jan 29, 2025
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

1 similar comment
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

@kingcrimsontianyu kingcrimsontianyu added non-breaking Introduces a non-breaking change and removed breaking Introduces a breaking change labels Jan 30, 2025
Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @kingcrimsontianyu. The only question is if the overhead of creating the extra gather task is an issue. I suspect not, but it would be good to get it confirmed.

cc. @GregoryKimball, @vuule.

/**
* @brief Submit the move-only task callable to the underlying thread pool.
*
* @tparam F Callable type. F shall be move-only and have no argument.
Copy link
Contributor Author

@kingcrimsontianyu kingcrimsontianyu Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is tempting to add the "can't copy" part to the code to improve type safety:

template <typename F, std::enable_if_t<!std::is_copy_constructible_v<F>, bool> = true>
std::future<std::size_t> submit_move_only_task(F op_move_only)

But this does not work.
For the non-copyable gather_tasks below, trying to copy construct results in compile error, but the type trait gives unexpected result.

auto gather_tasks_copy = gather_tasks; // Compile error as expected
static_assert(std::is_copy_constructible_v<decltype(gather_tasks)>); // No compile error. Surprise.

A simpler example:

std::vector<std::future<int>> move_only;
static_assert(std::is_copy_constructible_v<decltype(move_only)>); // No compile error. Surprise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kingcrimsontianyu kingcrimsontianyu force-pushed the improve-future-query branch 2 times, most recently from 41afba6 to 94c266e Compare February 1, 2025 03:27
@vyasr
Copy link
Contributor

vyasr commented Feb 4, 2025

I think this looks good overall to me now. Is it ready for a thorough review? I'm not sure if it is intentionally still in draft or not.

@kingcrimsontianyu kingcrimsontianyu changed the title [Experimental] Fix std::future status query Fix std::future status query Feb 4, 2025
@kingcrimsontianyu
Copy link
Contributor Author

Thanks for the reminder. Other than pending performance assessment, this PR should be ready for review.

@kingcrimsontianyu kingcrimsontianyu marked this pull request as ready for review February 4, 2025 15:56
@kingcrimsontianyu kingcrimsontianyu requested a review from a team as a code owner February 4, 2025 15:56
@kingcrimsontianyu
Copy link
Contributor Author

I'll use cuDF bench for performance assessment, and post the results here when ready.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll leave it to you to verify performance before merging.

@@ -152,6 +153,7 @@ std::tuple<void*, std::size_t, std::size_t> get_alloc_info(void const* devPtr,
template <typename T>
bool is_future_done(T const& future)
{
assert(future.valid());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a public API, do we want to throw if this condition is violated or are we happy with a debug assertion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think we do need to check the pre-condition to avoid UB. Done.

/**
* @brief Submit the move-only task callable to the underlying thread pool.
*
* @tparam F Callable type. F shall be move-only and have no argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kingcrimsontianyu
Copy link
Contributor Author

kingcrimsontianyu commented Feb 12, 2025

Performance comparison using libcudf's benchmark.

Setup

parquet_read_io_compression

2 threads 4 threads 8 threads 16 threads
parquet_read_io_compression CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase
cold cache This PR 0.102 0.029 0.51% 0.102 0.029 0.51% 0.097 0.010 -0.69% 0.099 0.015 -0.45%
branch-25.04 0.102 0.028 0.102 0.028 0.097 0.022 0.099 0.021
hot cache This PR 0.054 0.043 0.22% 0.046 0.036 -1.49% 0.044 0.030 -1.99% 0.045 0.026 -0.53%
branch-25.04 0.054 0.010 0.047 0.041 0.045 0.047 0.045 0.035

orc_read_io_compression

2 threads 4 threads 8 threads 16 threads
orc_read_io_compression CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase CPU Time (sec) Time increase
cold cache This PR 0.153 0.041 10.71% 0.128 0.014 16.61% 0.141 0.068 22.95% 0.146 0.084 17.35%
branch-25.04 0.138 0.013 0.110 0.023 0.115 0.023 0.124 0.037
hot cache This PR 0.068 0.038 6.38% 0.061 0.052 0.29% 0.062 0.025 1.00% 0.064 0.044 2.02%
branch-25.04 0.063 0.010 0.061 0.051 0.062 0.032 0.063 0.034

json_read_io

2 threads 4 threads 8 threads 16 threads
json_read_io CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase CPU Time (sec) Time increase
cold cache This PR 0.556 0.031 3.20% 0.451 0.005 -3.20% 0.460 0.055 2.41% 0.459 0.045 -6.99%
branch-25.04 0.539 0.008 0.466 0.037 0.449 0.034 0.494 0.013
hot cache This PR 0.350 0.047 3.12% 0.327 0.046 -0.29% 0.314 0.020 0.00% 0.323 0.052 0.12%
branch-25.04 0.339 0.006 0.328 0.054 0.314 0.024 0.322 0.041

csv_read_io

2 threads 4 threads 8 threads 16 threads
csv_read_io CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase CPU Time (sec) Time increase
cold cache This PR 0.347 0.032 0.72% 0.297 0.008 -2.46% 0.302 0.050 2.08% 0.308 0.006 -2.73%
branch-25.04 0.345 0.016 0.305 0.043 0.296 0.011 0.317 0.039
hot cache This PR 0.233 0.067 4.23% 0.212 0.068 -0.35% 0.215 0.029 0.70% 0.225 0.049 1.99%
branch-25.04 0.224 0.017 0.212 0.028 0.213 0.045 0.221 0.051

@kingcrimsontianyu
Copy link
Contributor Author

Rerun the orc_read_io_compression cold cache case with --min-samples 80 --timeout 60 to reduce the standard deviation:

2 threads 4 threads 8 threads 16 threads
orc_read_io_compression CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase CPU Time (sec) Time increase
cold cache This PR 0.152 0.027 11.84% 0.128 0.008 16.05% 0.137 0.049 18.82% 0.142 0.054 14.97%
branch-25.04 0.136 0.012 0.110 0.026 0.115 0.024 0.124 0.023

@kingcrimsontianyu
Copy link
Contributor Author

Increase the thread number N by 1, and compare (N+1) this PR vs (N) branch-25.04:

5 threads 9 threads 17 threads
orc_read_io_compression CPU Time (sec) Noise Time increase vs 4 threads 25.04 CPU Time (sec) Noise Time increase vs 8 threads 25.04 CPU Time (sec) Time increase vs 16 threads 25.04
cold cache This PR 0.131 0.034 19.43% 0.140 0.052 21.32% 0.138 0.017 11.96%

@kingcrimsontianyu
Copy link
Contributor Author

kingcrimsontianyu commented Feb 12, 2025

Terminology:

  • Small I/O: When the I/O size is less than GDS threshold, the I/O is:
    • Done lazily via std::async with deferred launch policy on branch 25.04
    • Done synchronously via make_ready_future in this PR
  • Parallel I/O: Perform I/O using the thread pool. The query of multiple futures is
    • Done lazily via std::async with deferred launch policy on branch 25.04
    • Submitted to the thread pool in this PR
2 threads 4 threads 8 threads 16 threads
orc_read_io_compression CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase CPU Time (sec) Time increase
cold cache Keep changes to parallel I/O only 0.152 0.016 11.71% 0.120 0.020 10.90% 0.109 0.010 -4.60% 0.119 0.029 -2.70%
Keep changes to small I/O only 0.140 0.021 2.70% 0.130 0.019 19.86% 0.140 0.056 21.96% 0.142 0.030 15.95%
branch-25.04 0.136 0.016 0.108 0.006 0.115 0.027 0.123 0.033

@madsbk
Copy link
Member

madsbk commented Feb 17, 2025

@kingcrimsontianyu, so the benchmark are saying that for large IO the new PR is good and for small IO the new PR is bad?

Could this be an ordering issue?
Consider a mix of small and large IOs in an arbitrary order. In branch-25.04 all of the large IOs will be submitted to the thread pool before the small IOs are being executed.
Whereas, in this PR, the small IOs are executed immediately and thus delaying the submission of the following large IOs?


// 1) Submit `task_size` sized tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about (possibly simpler):

std::vector<std::future<std::size_t>> tasks;
tasks.reserve(size / task_size);

while (size > task_size) {
  tasks.push_back(detail::submit_task(op, buf, task_size, file_offset, devPtr_offset));
  file_offset += task_size;
  devPtr_offset += task_size;
  size -= task_size;
}

auto last_task = [=, tasks = std::move(tasks)]() mutable -> std::size_t {
  auto ret = op(buf, size, file_offset, devPtr_offset);
  for (auto& task: tasks) {
    ret += task.get();
  }
  return ret;
};
return detail::submit_move_only_task(std::move(last_task));
}

But, I have a question, why not also submit the final read task and just gather all the results in the last_task?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR initially made that implementation as you described: we submit the sequence of N tasks, and instead of using std::async(std::launch::deferred) to launch the gather task as the dev branch does, we submit the gather task to the task queue of the thread pool. However, the performance regression was observed, somehow only on the ORC benchmark in libcudf. But we have noticed that if we let the last task do the gather operation (for the N-1 tasks), the regression would disappear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: The suggested code is much more concise. Replaced as suggested. Thanks!

@kingcrimsontianyu
Copy link
Contributor Author

@madsbk We had a sync last week, and yes! this is what we suspected was happening to the ORC benchmark. I'll do some profiling this week to hopefully better understand its I/O pattern.

@kingcrimsontianyu kingcrimsontianyu force-pushed the improve-future-query branch 3 times, most recently from e8455b0 to 359c19a Compare February 21, 2025 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working c++ Affects the C++ API of KvikIO non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants