Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel profiler based NoC event tracing #16541

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bgrady-tt
Copy link
Contributor

Problem Description

This PR adds support for recording detailed traces of all NoC activity initiated by Tensix worker cores. This data is then written to a JSON format file. These JSON traces can be either analyzed directly (as a sort of log file) or consumed by either flows (e.g. used to verify software noc performance estimator).

What's Changed

NoC tracing builds on top of the timestamped data packets (PacketType::TS_DATA) feature in the kernel profiler.

NoC tracing can be enabled on top of normal kernel profiling by setting the ENV variable TT_METAL_DEVICE_PROFILER_NOC_EVENTS=1.

A timestamped packet is recorded on each worker core for each call to dataflow_api.h:noc_async_* functions. All arguments to the dataflow_api.h function are bit-packed into the TS_DATA payload. This is done by adding new stripped-by-default macros to each noc_async* call of interest in dataflow_api.h.

  // example of RECORD macro within dataflow_api.h
  template <bool enable_noc_tracing = true>
  inline void noc_async_read(...) {
      // if TT_METAL_DEVICE_PROFILER_NOC_EVENTS != 1, the following macro is stripped out
      RECORD_NOC_EVENT_WITH_ADDR(...);
      // noc_async_read continues unmodified here ... 
  }

For situations where it is impossible/undesirable for noc_async_* calls to be profiled, template arguments have been added to all effected sites so that noc tracing can be disabled at the call site. This is done within the kernel profiler itself to avoid a circular dependency.

  noc_async_write<false>(...); // noc tracing is stripped out at compile time at this call site

Concerns

  • kernel_profiler::quick_push() and supporting code has been modified to compile for BRISC and NCRISC, even if dispatch is false.
  • kernel_profiler::timestampedData() has been modified to check if bufferHasRoom; if there is no room, quick_push() is automatically called to flush the buffer to make space.
  • Considerable overlap with existing features like Watcher Waypoints and NoC address sanitizer.

As far as I can tell (informal testing), regular kernel profiling still works as expected (with or without noc event profiling being enabled).

Despite the overlap with existing profiler-based features, there is meaningful differentiation here. Capturing all of the details of each NoC transaction (packet size, virtual channel, timestamp) is useful (probably to many different teams) and currently not accomplished by other methods.

Checklist

  • Post commit CI passes
    • Post commit CI passes before rebase to latest origin/main

@bgrady-tt bgrady-tt self-assigned this Jan 8, 2025
@bgrady-tt bgrady-tt marked this pull request as ready for review January 8, 2025 22:43
@bgrady-tt bgrady-tt force-pushed the bgrady/squashed-noc-trace branch from ecd2252 to bd1dd4d Compare March 4, 2025 15:01
bgrady-tt added a commit that referenced this pull request Mar 4, 2025
@bgrady-tt bgrady-tt force-pushed the bgrady/squashed-noc-trace branch from bd1dd4d to cc919af Compare March 4, 2025 15:05
bgrady-tt added a commit that referenced this pull request Mar 6, 2025
@bgrady-tt bgrady-tt force-pushed the bgrady/squashed-noc-trace branch from cc919af to 5d3f02b Compare March 6, 2025 15:27
bgrady-tt added a commit that referenced this pull request Mar 6, 2025
@bgrady-tt bgrady-tt force-pushed the bgrady/squashed-noc-trace branch from 5d3f02b to 0b788f8 Compare March 6, 2025 15:33
@bgrady-tt bgrady-tt force-pushed the bgrady/squashed-noc-trace branch from 0b788f8 to 3d569a8 Compare March 6, 2025 16:19
Copy link
Contributor

@mo-tenstorrent mo-tenstorrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -69,7 +69,12 @@ constexpr static std::uint32_t PROFILER_L1_BUFFER_SIZE = PROFILER_L1_VECTOR_SIZE

} // namespace kernel_profiler

#if defined(TRACY_ENABLE) && defined(DEVICE_PROFILER_OP_SUPPORT_COUNT_OVERRIDE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice !! Thanks.

if constexpr (requires { device_operation_t::get_type_name(operation_attributes); }) {
opName = device_operation_t::get_type_name(operation_attributes);
}
runtime_id_to_opname.emplace({device_id, program.get_runtime_id()}, opName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how get_type_name is implemented for the op, we might acquire considerable perf hit on cached runs.

Can we push the emplace to right before the return and lookup opName from cached_ops. It will be filled by the end of this function. We do need a local return variable this way.

}

void DeviceProfiler::logPacketDataToCSV(
const IDevice* device,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if getPhysicalAddressFromVirtual takes device_id instead of IDevice*. We don't need to pass IDevice down from readRiscProfilerResults

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants