Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(profiling): avoid stack size limits on musl for stack v2 threads #12174

Open
wants to merge 14 commits into
base: 3.x-staging
Choose a base branch
from

Conversation

sanchda
Copy link
Contributor

@sanchda sanchda commented Jan 30, 2025

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Yun-Kim and others added 9 commits January 27, 2025 21:23
## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

---------

Co-authored-by: Nicole Cybul <nicole.cybul@datadoghq.com>
#12075)

We added locking to the memory profiler to address crashes. These locks
are mostly "try" locks, meaning we bail out if we can't acquire them
right away. This was done defensively to mitigate the possibility of
deadlock until we fully understood why the locks are needed and could
guarantee their correctness. But as a result of using try locks, the
`iter_events` function in particular can fail if the memory profiler lock
is contended when it tries to collect profiling events. The function
then returns NULL, leading to SystemError exceptions because we don't
set an error.

Even if we set an error, returning NULL isn't the right thing to do.
It'll basically mean we wait until the next profile iteration, still
accumulating events in the same buffer, and try again to upload the
events. So we're going to get multiple iteration's worth of events. The
right thing to do is take the lock unconditionally in `iter_events`. We
can allocate the new tracker outside the memory allocation profiler lock
so that we don't need to worry about reentrancy/deadlock issues if
we start profiling that allocation. Then, the only thing we do under the
lock is swap out the global tracker, so it's safe to take the lock
unconditionally.

Fixes #11831

TODO - regression test?
…le (#12035)

## Motivation

Refactors all web server integrations still using `tracer.trace` to
instead use `core.context_with_data`. This is in preparation for
supporting AWS API Gateway to ensure all web servers share the same code
path.

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
Depending of the timing, libddwaf loading process could create triggers
that would create loops in our instrumentation.
From what I investigated:
- if loaded too early, it could have bad interactions with gevent.
- if loaded too late, it could be self instrumented by the tracer,
creating a loop, as ctypes is using Popen and subprocess.

while keeping the late loading introduced by 2 previous PRs
- #11987
- #12013
this PR introduced a mechanism to bypass tracer instrumentation during
ctypes loading, to avoid a possible loop that would prevent the WAF to
be loaded.


## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
…t config (#12089)

This PR makes a change to our shared distributed tracing header
injection method to dispatch signals/events instead of relying on the
global config settings, which is only modifiable via env vars. This
fixes distributed tracing for users that might rely solely on the
`LLMObs.enable()` setup config.

Programmatic `LLMObs.enable()/disable()` calls do not set the global
`config._llmobs_enabled` boolean setting, which is only controlled by
the `DD_LLMOBS_ENABLED` env var. This was problematic for users that
relied on manual `LLMObs.enable()` setup (i.e. no env vars) because our
distributed tracing injection code only checks the global config to
inject llmobs parent IDs into request headers. If users manually enabled
LLMObs without any env vars, then this would not be reflected in the
global config value and thus LLMObs parent IDs would never be injected
into the request headers.

We can't check directly if LLMObs is enabled in the http injection
module because:
1. This would require us to import significant product-specific
LLMObs-code into the shared http injector helper module which would
impact non-LLMObs users' app performance
2. Circular imports in LLMObs which imports http injector logic to use
in its own helpers

Instead of doing our check based on the global `config._llmobs_enabled`
setting, we now send a tracing event to our shared product listeners,
and register a corresponding `LLMObs._inject_llmobs_context()` hook to
be called for all inject() calls if LLMObs is enabled (we check the
LLMObs instance, not the global config setting value).

~One risk and why I don't like changing global config settings is
because this then implies that it is no longer global or tied to an env
var (I want to push for env var configuration where possible over manual
overriding/enabling). If a global enabled config can be toggled
indiscriminately then this could open a can of worms for
enabling/disabling logic in our LLMObs service, which isn't really
designed to be toggled on/off multiple times in the app's lifespan.
However if some users cannot rely on env vars, then I don't see any
other solution that does not couple tracer internal code with LLMObs
code which is a no-option.~ (UPDATE: we avoided this issue by using
signal dispatching)

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
ddtrace v3.0 is set to remove `ddtrace.providers` module from the public
API. However we should still allow users to use their own
ContextProviders. This PR ensures the BaseContextProvider remains in the
public API and can be used in `tracer.configure(...)`.

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
@sanchda sanchda requested a review from a team as a code owner January 30, 2025 16:56
@datadog-dd-trace-py-rkomorn
Copy link

datadog-dd-trace-py-rkomorn bot commented Jan 30, 2025

Datadog Report

Branch report: sanchda/stackv2_thread_limits
Commit report: bd4d0be
Test service: dd-trace-py

✅ 0 Failed, 130 Passed, 1468 Skipped, 4m 28.14s Total duration (35m 42.66s time saved)

@taegyunkim taegyunkim force-pushed the sanchda/stackv2_thread_limits branch from 7b3f089 to c6c4f15 Compare January 30, 2025 17:01
@taegyunkim taegyunkim requested review from a team as code owners January 30, 2025 17:01
@taegyunkim taegyunkim changed the base branch from main to 3.x-staging January 30, 2025 17:01
@taegyunkim taegyunkim removed request for a team January 30, 2025 17:02
@taegyunkim taegyunkim removed request for a team January 30, 2025 17:02
Copy link
Contributor

github-actions bot commented Jan 30, 2025

CODEOWNERS have been resolved as:

ddtrace/_trace/trace_handlers.py                                        @DataDog/apm-sdk-api-python
ddtrace/contrib/internal/aiohttp/middlewares.py                         @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/contrib/internal/bottle/trace.py                                @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/contrib/internal/cherrypy/patch.py                              @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/contrib/internal/falcon/middleware.py                           @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/contrib/internal/molten/patch.py                                @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/contrib/internal/pyramid/trace.py                               @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/contrib/internal/sanic/patch.py                                 @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/internal/datadog/profiling/stack_v2/include/sampler.hpp         @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack_v2/src/sampler.cpp             @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack_v2/src/stack_v2.cpp            @DataDog/profiling-python
src/native/Cargo.lock                                                   @DataDog/apm-core-python
tests/smoke_test.py                                                     @DataDog/apm-core-python
tests/snapshots/tests.contrib.cherrypy.test_middleware.test_child.json  @DataDog/apm-python
tests/snapshots/tests.contrib.cherrypy.test_middleware.test_error.json  @DataDog/apm-python
tests/snapshots/tests.contrib.cherrypy.test_middleware.test_fatal.json  @DataDog/apm-python
tests/snapshots/tests.contrib.cherrypy.test_middleware.test_success.json  @DataDog/apm-python
tests/snapshots/tests.contrib.sanic.test_sanic_server.test_multiple_requests_sanic_http.json  @DataDog/apm-python
tests/snapshots/tests.contrib.sanic.test_sanic_server.test_multiple_requests_sanic_http_pre_21.9.json  @DataDog/apm-python
tests/snapshots/tests.contrib.sanic.test_sanic_server.test_sanic_errors.json  @DataDog/apm-python
tests/snapshots/tests.contrib.sanic.test_sanic_server.test_sanic_errors_pre_21.9.json  @DataDog/apm-python

@taegyunkim taegyunkim force-pushed the sanchda/stackv2_thread_limits branch from c6c4f15 to c18b1f6 Compare January 30, 2025 17:03
@taegyunkim taegyunkim added the Profiling Continous Profling label Jan 30, 2025
@pr-commenter
Copy link

pr-commenter bot commented Jan 30, 2025

Benchmarks

Benchmark execution time: 2025-01-30 22:30:56

Comparing candidate commit bd4d0be in PR branch sanchda/stackv2_thread_limits with baseline commit 0ed084a in branch 3.x-staging.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 394 metrics, 2 unstable metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Profiling Continous Profling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants