Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAM: more cold functions #9850

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

DRAM: more cold functions #9850

wants to merge 4 commits into from

Conversation

lyakh
Copy link
Collaborator

@lyakh lyakh commented Feb 21, 2025

Move all of IPC and some initialisation code to DRAM.

Move several initialisation functions to run from DRAM directly.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
@marcinszkudlinski
Copy link
Contributor

I understand why init functions should go to DRAM, but why IPC?

@lyakh
Copy link
Collaborator Author

lyakh commented Feb 24, 2025

I understand why init functions should go to DRAM, but why IPC?

@marcinszkudlinski the idea is that only audio protocols are "hot" - only schedulers and audio processing threads. Everything else can be "cold" and IPC processing is one of such large code areas. But if you have concerns that this can break something, let's discuss, maybe we're overlooking some use-cases?

@marcinszkudlinski
Copy link
Contributor

@lyakh not really
we're already facing problems with performance - when starting multiple sophisticated pipelines it happens that some of LL cycles are lost - because of long operations like "prepare" for each component
We need to be careful what goes to DRAM, it is slower, and worse, the access time is not guaranteed - as the physical memory is shared with linux/windows/chrome and our requests go last.

I think - as long as we do have enough HPSRAM, use it.

@abonislawski
Copy link
Member

IPC part looks really suspicious, do you have any data what is the profit and perf drop? Especially when main CPU is under high load and we will lag more with DRAM access

@lgirdwood
Copy link
Member

HPSRAM is precious, agree need to be really careful what we put in DRAM it should only be parts of IPC that are not time critical. i.e. trigger is time critical, but load module is not time critical. We need to find this balance, Linux only really cares about prepare()/trigger() driver ops and any associated IPCs. Don't know about Windows ?

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some functions are really obvious pipeline construction/free APIs, but some utility APIs could be used in the stream triggering flow. Best to check.

@lyakh
Copy link
Collaborator Author

lyakh commented Feb 25, 2025

@lgirdwood @marcinszkudlinski @abonislawski as far as I understand the worst would be cases when we're running close to 100% performance capacity and at that moment the user is issuing some IPCs - maybe to start an additional light stream. In principle we still have a couple of free DSP cycles to run an additional stream, but while preparing it, IPC processing adds significant DSP load. So, if we process IPCs in DRAM, that processing becomes slower. As long as we don't disable interrupts during IPC processing for too long, we still shouldn't disturb higher priority audio processing, running in parallel, but IPC response time will become longer. Is that what we're worried about? Is that important? Replying to @marcinszkudlinski - do we really lose LL cycles because of IPC processing? That shouldn't happen AFAICS? If we have code, locking interrupts, we have to identify and improve it...

@lgirdwood
Copy link
Member

Replying to @marcinszkudlinski - do we really lose LL cycles because of IPC processing? That shouldn't happen AFAICS? If we have code, locking interrupts, we have to identify and improve it...

We don't lose LL cycles since LL preempts low priority workloads/threads (even if workload TEXT is in DRAM, stack/heap will be SRAM). @jsarha can you share some data soon. Thanks

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, @lrgirdwo you mention in comments that " trigger is time critical, but load module is not time critical". The current PR doesn't seem to make any provision to keep trigger related code in hot memory. Not sure how to review this, is this intentional or not?

{
struct ipc4_message_request in;

in.primary.dat = msg_data.msg_in.pri;
ipc_compound_msg_done(in.primary.r.type, reply->error);
}

void ipc_cmd(struct ipc_cmd_hdr *_hdr)
__cold void ipc_cmd(struct ipc_cmd_hdr *_hdr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to separate trigger/start from less timing critical IPCs, then we need to keep this top-level ipc_cmd as warm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should keep most important ipc funcs in HPSRAM, especially the whole trigger path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IPC path runs outside of the LL context and in most cases is not time sensitive with the exception of trigger path. @lyakh can you update to remove the trigger path from __Cold and add in the new debug API to the non trigger calls.

@jsarha
Copy link
Contributor

jsarha commented Feb 27, 2025

Replying to @marcinszkudlinski - do we really lose LL cycles because of IPC processing? That shouldn't happen AFAICS? If we have code, locking interrupts, we have to identify and improve it...

We don't lose LL cycles since LL preempts low priority workloads/threads (even if workload TEXT is in DRAM, stack/heap will be SRAM). @jsarha can you share some data soon. Thanks

Screenshot at 2025-02-27 16-22-11

There is indeed some impact to MCPS at least in 44.1kHz playback trough SRC. SRC playback was chosen because its readily available on nocodec topology and SRC has a lot of __cold tagged functions in its configuration code. In addition to this PR I also merged #9844 on top of it. The test is a 5min 44.1kHz playback using the branch built with xcc using both CONFIG_COLD_STORE_EXECUTE_DRAM=n and y. It was run on LNL RVP using nocodec topology. The original mtrace files are here:
testb-dram-y-hw02-300s-mtrace.log
testb-dram-n-hw02-300s-mtrace.log

@lgirdwood
Copy link
Member

There is indeed some impact to MCPS at least in 44.1kHz playback trough SRC. SRC playback was chosen because its readily available on nocodec topology and SRC has a lot of __cold tagged functions in its configuration code. In addition to this PR I also merged #9844 on top of it. The test is a 5min 44.1kHz playback using the branch built with xcc using both CONFIG_COLD_STORE_EXECUTE_DRAM=n and y. It was run on LNL RVP using nocodec topology. The original mtrace files are here:
testb-dram-y-hw02-300s-mtrace.log
testb-dram-n-hw02-300s-mtrace.log

Thanks @jsarha - there is a 20kcps delta with DRAM=y and this PR on LNL. I think the Peaks are related to L1 exit work, I think the 20kcps is due the the relocatable code used for llext. @lyakh do you concur ?
@jsarha btw - can you upstream the script that scrapes the logs and produces the plots :)

@abonislawski
Copy link
Member

I think the Peaks are related to L1 exit work

In general Peaks are a bigger problem than avg in DRAM due to access instability (comparing to "flat" HPSRAM).

@jsarha could you please repeat your test with added main cpu (and mem controller) load? (try large fft and smallest fft).
Currently it shows best-case scenario with very low cpu/mem load, with fft test it will show worst-case scenario.
It is very important to test both for DRAM because our latencies may vary significantly

lyakh added 3 commits March 4, 2025 12:08
Mark most IPC functions as "cold" to run them directly in DRAM.
Explicitly avoid making exported functions "cold," also those that
are either known to or can potentially be called from hot code paths.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
ipc4_pipeline_complete() returns POSIX error codes, not IPC4 ones.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
struct module_interface::trigger() is called from both hot and cold
contexts, therefore it usually shouldn't be marked as __cold.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
@lyakh
Copy link
Collaborator Author

lyakh commented Mar 4, 2025

Thanks @jsarha - there is a 20kcps delta with DRAM=y and this PR on LNL. I think the Peaks are related to L1 exit work, I think the 20kcps is due the the relocatable code used for llext. @lyakh do you concur ?

@lgirdwood I think we need some more research there

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 6, 2025

@lgirdwood @jsarha lightly checked on MTL with nocodec on Port2 playback (core 2, includes SRC, volume and mixin - all have __cold code and some have __cold_rodata data) with no additional load - the difference remains within around 15kcps.
UPDATE: for a counter-test moved mixin_process() and mixout_process() to DRAM and ran the same test. This time processing jumped by 1550kcps, by more than 80%

@lgirdwood
Copy link
Member

@lgirdwood @jsarha lightly checked on MTL with nocodec on Port2 playback (core 2, includes SRC, volume and mixin - all have __cold code and some have __cold_rodata data) with no additional load - the difference remains within around 15kcps. UPDATE: for a counter-test moved mixin_process() and mixout_process() to DRAM and ran the same test. This time processing jumped by 1550kcps, by more than 80%

Ok, thanks @jsarha that makes sense, mixin/mixout are not yet __cold enabled so doing process() in DRAM will have that level of impact.

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 6, 2025

CI:

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove the trigger path and add the new debug API

{
struct ipc4_message_request in;

in.primary.dat = msg_data.msg_in.pri;
ipc_compound_msg_done(in.primary.r.type, reply->error);
}

void ipc_cmd(struct ipc_cmd_hdr *_hdr)
__cold void ipc_cmd(struct ipc_cmd_hdr *_hdr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IPC path runs outside of the LL context and in most cases is not time sensitive with the exception of trigger path. @lyakh can you update to remove the trigger path from __Cold and add in the new debug API to the non trigger calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants