From 543b18dfaa3d60110147682bcb7f9107600a14d0 Mon Sep 17 00:00:00 2001 From: Florian Lehner Date: Thu, 12 Dec 2024 09:18:50 +0100 Subject: [PATCH 1/4] specs - profiling integration: Make host.id in registration message optional `host.id` is a not well and uniquely defined attribute, see https://github.com/open-telemetry/semantic-conventions/issues/581 for example. In particular on containerized environments profiling agents do see a different `host.id` than APM-agents, which makes it harder to correlate information. To being able to correlate profiling and APM information, `container.id` was identified to fit the use case best. As profiling as well as APM agents already collect and send out `container.id` with their respective data. For non containerized environment `host.id` still can be used and in such a use cases profiling agents and APM-agents should have the same understanding of `host.id`. For backwards compatibility reasons just make the argument for `host-id` in the registration message optional. --- specs/agents/universal-profiling-integration.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/specs/agents/universal-profiling-integration.md b/specs/agents/universal-profiling-integration.md index 4165084d..ec22a87e 100644 --- a/specs/agents/universal-profiling-integration.md +++ b/specs/agents/universal-profiling-integration.md @@ -179,10 +179,10 @@ All messages have the following layout: ## Profiler Registration Message -Whenever the profiling host agent starts communicating for the first time with a process running an APM Agent, it MUST send this message. -This message is used to let the APM-agent know that a profiler is actually active on the current host. Note that an APM-agent may receive this message zero, one or several times: this may happen if no host agent is active, if one is active or if a host agent is restarted during the lifetime of the APM-agent respectively. +Whenever the profiling agent starts communicating for the first time with a process running an APM Agent, it MUST send this message. +This message is used to let the APM-agent know that a profiler is actually active on the current host. Note that an APM-agent may receive this message zero, one or several times: this may happen if no profiling agent is active, if one is active or if a profiling agent is restarted during the lifetime of the APM-agent respectively. -The *message-type* is `2` and the current *minor-version* is `1`. +The *message-type* is `2` and the current *minor-version* is `2`. The payload layout is as follows: Name | Data type @@ -190,8 +190,8 @@ Name | Data type samples-delay-ms | uint32 host-id | utf8-str -* *samples-delay-ms*: A sane upper bound of the usual time taken in milliseconds by the profiling host agent between the collection of a stacktrace and it being written to the apm-agent via the [messaging socket](#cpu-profiler-trace-correlation-message). The APM-agent will assume that all profiling data related to a span has been written to the socket if a span ended at least the provided duration ago. Note that this value doesn't need to be a hard a guarantee, but it should be the 99% case so that profiling data isn't distorted in the expected case. -* *host-id*: The [`host.id` resource attribute](https://opentelemetry.io/docs/specs/semconv/attributes-registry/host/) used for the profiling data by this profiling host agent. If an APM-agent is already sending a `host.id` it MUST print a warning if the `host.id` is different and otherwise ignore the value received by the host agent. A mismatch will lead to certain correlation features (e.g. cost and CO2 consumption) not working. If an agent does not collect the `host.id` by itself, it MUST start sending the `host.id` after receiving it from the profiler host agent to ensure aforementioned correlation features work correctly. +* *samples-delay-ms*: A sane upper bound of the usual time taken in milliseconds by the profiling agent between the collection of a stacktrace and it being written to the apm-agent via the [messaging socket](#cpu-profiler-trace-correlation-message). The APM-agent will assume that all profiling data related to a span has been written to the socket if a span ended at least the provided duration ago. Note that this value doesn't need to be a hard a guarantee, but it should be the 99% case so that profiling data isn't distorted in the expected case. +* *host-id*: The [`host.id` resource attribute](https://opentelemetry.io/docs/specs/semconv/attributes-registry/host/) is an optional argument used to correlate profiling data by the profiling agent. If an APM-agent is already sending a `host.id` it MUST print a warning if the `host.id` is different and otherwise ignore the value received by the profiling agent. A mismatch will lead to certain correlation features (e.g. cost and CO2 consumption) not working. If an APM-agent does not collect the `host.id` by itself, it MUST start sending the `host.id` after receiving it from the profiling agent to ensure aforementioned correlation features work correctly. ## CPU Profiler Trace Correlation Message @@ -236,7 +236,7 @@ For example, if for a single transaction the following correlation messages are the resulting transaction MUST have the OpenTelemetry attribute `elastic.profiler_stack_trace_ids` with a value of (elements in any order) `[YLQguzhR2dR6y5M9vnA5mw, YLQguzhR2dR6y5M9vnA5mw, TJMmu5gF-o-FiCwS6uckzg, YLQguzhR2dR6y5M9vnA5mw]`. -Note that the [correlation messages](#cpu-profiler-trace-correlation-message) will arrive delayed relative to when they were sampled due to the processing delay of the profiling host agent and the transfer over the domain socket. APM agents therefore MUST defer sending ended transactions until they are relatively confident that all correlation messages for the transaction have arrived. +Note that the [correlation messages](#cpu-profiler-trace-correlation-message) will arrive delayed relative to when they were sampled due to the processing delay of the profiling agent and the transfer over the domain socket. APM agents therefore MUST defer sending ended transactions until they are relatively confident that all correlation messages for the transaction have arrived. * When a [profiler registration message](#profiler-registration-message) has been received, APM agents SHOULD use the duration from that message as delay for transactions * If no [profiler registration message](#profiler-registration-message) has been received yet, APM agents SHOULD use a default of one second as reasonable default delay. @@ -262,4 +262,4 @@ OpenTelemetry based agents SHOULD use the following configuration options: * `ELASTIC_OTEL_UNIVERSAL_PROFILING_INTEGRATION_BUFFER_SIZE` - The size of the FIFO queue [used to buffer transactions](#correlation-attribute) until all correlation data has arrived. Should have a reasonable default to sustain typical transaction per second rates while not occupying too much memory in edge cases (e.g. 8096). \ No newline at end of file + The size of the FIFO queue [used to buffer transactions](#correlation-attribute) until all correlation data has arrived. Should have a reasonable default to sustain typical transaction per second rates while not occupying too much memory in edge cases (e.g. 8096). From db9108d468b6de0c5bf925fbc7e8d7a1be70afa1 Mon Sep 17 00:00:00 2001 From: Florian Lehner Date: Thu, 12 Dec 2024 11:26:25 +0100 Subject: [PATCH 2/4] Update specs/agents/universal-profiling-integration.md Co-authored-by: Jonas Kunz --- specs/agents/universal-profiling-integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/universal-profiling-integration.md b/specs/agents/universal-profiling-integration.md index ec22a87e..9653b00d 100644 --- a/specs/agents/universal-profiling-integration.md +++ b/specs/agents/universal-profiling-integration.md @@ -182,7 +182,7 @@ All messages have the following layout: Whenever the profiling agent starts communicating for the first time with a process running an APM Agent, it MUST send this message. This message is used to let the APM-agent know that a profiler is actually active on the current host. Note that an APM-agent may receive this message zero, one or several times: this may happen if no profiling agent is active, if one is active or if a profiling agent is restarted during the lifetime of the APM-agent respectively. -The *message-type* is `2` and the current *minor-version* is `2`. +The *message-type* is `2` and the current *minor-version* is `1`. The payload layout is as follows: Name | Data type From 64941db727c0a953efdd679b48384970cee28deb Mon Sep 17 00:00:00 2001 From: Florian Lehner Date: Thu, 12 Dec 2024 11:26:41 +0100 Subject: [PATCH 3/4] Update specs/agents/universal-profiling-integration.md Co-authored-by: Jonas Kunz --- specs/agents/universal-profiling-integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/universal-profiling-integration.md b/specs/agents/universal-profiling-integration.md index 9653b00d..1fb193da 100644 --- a/specs/agents/universal-profiling-integration.md +++ b/specs/agents/universal-profiling-integration.md @@ -191,7 +191,7 @@ samples-delay-ms | uint32 host-id | utf8-str * *samples-delay-ms*: A sane upper bound of the usual time taken in milliseconds by the profiling agent between the collection of a stacktrace and it being written to the apm-agent via the [messaging socket](#cpu-profiler-trace-correlation-message). The APM-agent will assume that all profiling data related to a span has been written to the socket if a span ended at least the provided duration ago. Note that this value doesn't need to be a hard a guarantee, but it should be the 99% case so that profiling data isn't distorted in the expected case. -* *host-id*: The [`host.id` resource attribute](https://opentelemetry.io/docs/specs/semconv/attributes-registry/host/) is an optional argument used to correlate profiling data by the profiling agent. If an APM-agent is already sending a `host.id` it MUST print a warning if the `host.id` is different and otherwise ignore the value received by the profiling agent. A mismatch will lead to certain correlation features (e.g. cost and CO2 consumption) not working. If an APM-agent does not collect the `host.id` by itself, it MUST start sending the `host.id` after receiving it from the profiling agent to ensure aforementioned correlation features work correctly. +* *host-id*: The [`host.id` resource attribute](https://opentelemetry.io/docs/specs/semconv/attributes-registry/host/) is an optional argument (the string may have a length of zero) used to correlate profiling data by the profiling agent. If an APM-agent is already sending a `host.id` it MUST print a warning if the `host.id` is different and otherwise ignore the value received by the profiling agent. A mismatch will lead to certain correlation features (e.g. cost and CO2 consumption) not working. If an APM-agent does not collect the `host.id` by itself, it MUST start sending the `host.id` after receiving a non-empty `host.id` from the profiling agent to ensure aforementioned correlation features work correctly. ## CPU Profiler Trace Correlation Message From 02432a65e4fde3fad5529ce725bcec5dbebaeb9b Mon Sep 17 00:00:00 2001 From: Florian Lehner Date: Fri, 3 Jan 2025 10:03:28 +0100 Subject: [PATCH 4/4] Update specs/agents/universal-profiling-integration.md Co-authored-by: Christos Kalkanis --- specs/agents/universal-profiling-integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/agents/universal-profiling-integration.md b/specs/agents/universal-profiling-integration.md index 1fb193da..519dc09b 100644 --- a/specs/agents/universal-profiling-integration.md +++ b/specs/agents/universal-profiling-integration.md @@ -190,7 +190,7 @@ Name | Data type samples-delay-ms | uint32 host-id | utf8-str -* *samples-delay-ms*: A sane upper bound of the usual time taken in milliseconds by the profiling agent between the collection of a stacktrace and it being written to the apm-agent via the [messaging socket](#cpu-profiler-trace-correlation-message). The APM-agent will assume that all profiling data related to a span has been written to the socket if a span ended at least the provided duration ago. Note that this value doesn't need to be a hard a guarantee, but it should be the 99% case so that profiling data isn't distorted in the expected case. +* *samples-delay-ms*: A sane upper bound for the time taken in milliseconds by the profiling agent between the collection of a stacktrace and it being written to the APM agent via the [messaging socket](#cpu-profiler-trace-correlation-message). The APM agent will assume that all profiling data related to a span has been written to the socket if a span ended at least the provided duration ago. Note that this value doesn't need to be a hard guarantee, but it should be the 99% case so that profiling data isn't distorted in the expected case. * *host-id*: The [`host.id` resource attribute](https://opentelemetry.io/docs/specs/semconv/attributes-registry/host/) is an optional argument (the string may have a length of zero) used to correlate profiling data by the profiling agent. If an APM-agent is already sending a `host.id` it MUST print a warning if the `host.id` is different and otherwise ignore the value received by the profiling agent. A mismatch will lead to certain correlation features (e.g. cost and CO2 consumption) not working. If an APM-agent does not collect the `host.id` by itself, it MUST start sending the `host.id` after receiving a non-empty `host.id` from the profiling agent to ensure aforementioned correlation features work correctly.