Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Switch from Progrock to OpenTelemetry (dagger#6835)
* progrock -> otel * All Progrock plumbing is now gone, though we may want to bring it back for compatibility. Removing it was a useful exercise to find the many places where we're relying on it. * TUI now supports -v, -vv, -vvv (configurable verbosity). Global flags like --debug, -v, --silent, etc. are processed anywhere in the command string and respected. * CLI forwards engine traces and logs to configured exporters, no need to configure engine-side (we already need this flow for the TUI) * "Live" spans are emitted to TUI and cloud, filtered out before sending to a traditional (non-Live) otel stack * Engine supports pub/sub for traces and logs, can be exposed in the future as a GraphQL subscription * Refactor context.Background usage to context.WithoutCancel. We usually don't want a total reset, since that drops the span context and any other telemetry related things (loggers etc). Go 1.21 added context.WithoutCancel which is more precise. * engine: don't include source in slogs. Added this prospectively and it doesn't seem worth the noise. * idtui: DB can record multiple traces, polish * multi traces is mostly for dagviz, so i can run it with a single DB * add 'passthrough' UI flag which tells the UI to ignore a span and descend into its children * add 'ignore' UI flag, to be used sparingly for things whose signal:noise ratio is irredeemibly low (e.g. 'id' calls) * make loadFooFromID calls passthrough * make Buildkit gRPC calls passthrough * Global Progrock rogs are theoretically replaced with tracing.GlobalLogger, but it has yet to be integrated into anything. * Module functions are pure after all. They're already cached per-session, so this makes DagQL reflect that, avoiding duplicate Buildkit work that would be deduped at the Buildkit layer. Cleans up the telemetry since previously you'd see duplicate queries. * TODO: ensure draining is airtight * TODO: global logging to TUI * TODO: batch forwarded engine spans instead of emitting them "live" * TODO: fix dagger terminal Signed-off-by: Alex Suraci <alex@dagger.io> * fix log draining, again, ish previously we would cancel all subscribers for a trace whenever a client/derver went away. but for modules/nesting this meant the inner call would cancel the whole trace early. * TODO: looks like services still don't drain completely? Signed-off-by: Alex Suraci <alex@dagger.io> * don't set up logs if not configured Signed-off-by: Alex Suraci <alex@dagger.io> * respect configured level Signed-off-by: Alex Suraci <alex@dagger.io> * clean up shim early tracing remnants Signed-off-by: Alex Suraci <alex@dagger.io> * synchronously detach services on main client exit previously service spans would be left incomplete on exit. now we'll detach from them on shutdown, which will only stop the service if we're the last depender on it. end result _should_ be that services are always completed through telemetry, but I've seen maybe 2 in 50 runs still leave it running. still troubleshooting, but without this change there is no hope at all. fixes dagger#6493 Signed-off-by: Alex Suraci <alex@dagger.io> * flush telemetry before closing server clients Honestly not 100% confirmed, but seems right. I think the final solution might be to get traces/logs out without going through a session in the first place. Signed-off-by: Alex Suraci <alex@dagger.io> * switch from errgroup to conc for panic handling seeing a panic in ExportSpans/UploadTraces, this should help avoid bringing whole server down - I think - or at least give us hope. Signed-off-by: Alex Suraci <alex@dagger.io> * nest 'starting session' beneath 'connect' Signed-off-by: Alex Suraci <alex@dagger.io> * send logs out from engine to log exporter too Signed-off-by: Alex Suraci <alex@dagger.io> * bump midterm Signed-off-by: Alex Suraci <alex@dagger.io> * switch to server-side telemetry pub/sub fetching the logs/traces over a session is really annoying with draining because the session itself gets closed before things can be fully flushed. Signed-off-by: Alex Suraci <alex@dagger.io> * show newer traces first Signed-off-by: Alex Suraci <alex@dagger.io> * cleanup Signed-off-by: Alex Suraci <alex@dagger.io> * send individual Calls over telemetry instead of IDs More than a 10x efficiency increase. Frontend still super easy to implement. Test: # in ~/src/bass $ with-dev dagger call -m ./ --src https://github.com/vito/bass unit --packages ./pkg/cli stdout --debug &> out $ rg measuring out | cut -d= -f2 | xargs | tr ' ' '+' | sed -e 's/0m//g' -e 's/[^0-9\+]//g' | cat -v | bc Before: 8524838 (~8.1 MiB) After: 727039 (~0.7 MiB) Signed-off-by: Alex Suraci <alex@dagger.io> * idtui Base was correct in returning bool Signed-off-by: Alex Suraci <alex@dagger.io> * handle case where calls haven't been seen yet kinda hacky, but it makes sense that we need to handle this, cause loadFooFromID or generally anything can take an ID that's never been seen by the server before, and the loadFooFromID span will come first. Signed-off-by: Alex Suraci <alex@dagger.io> * idtui: add space between progress and primary output Signed-off-by: Alex Suraci <alex@dagger.io> * swap -vvv and -vv, -vv now breaks encapsulation Signed-off-by: Alex Suraci <alex@dagger.io> * cleanups Signed-off-by: Alex Suraci <alex@dagger.io> * tidy mage Signed-off-by: Alex Suraci <alex@dagger.io> * tidy Signed-off-by: Alex Suraci <alex@dagger.io> * loosen go.mod constraints Signed-off-by: Alex Suraci <alex@dagger.io> * revive labels tests Signed-off-by: Alex Suraci <alex@dagger.io> * fix cachemap tests Signed-off-by: Alex Suraci <alex@dagger.io> * nuclear option: wait for all spans to complete Rather than closing the telemetry connection and hoping the timing works out, we keep track of which traces have active spans and wait for that count to reach 0. A bit more complicated but not seeing a simpler solution really. Without this we can't ensure that the client sees the very outermost spans complete. Signed-off-by: Alex Suraci <alex@dagger.io> * pass-through all gRPC stuff hasn't really been useful, it's available in the full trace for devs, or we can add a verbosity level. Signed-off-by: Alex Suraci <alex@dagger.io> * dagviz: tweaks to support visualizing a live trace Signed-off-by: Alex Suraci <alex@dagger.io> * better 'docker tag' parsing Signed-off-by: Alex Suraci <alex@dagger.io> * fixup docker tag check Signed-off-by: Alex Suraci <alex@dagger.io> * pass auth headers to OTLP logs too Signed-off-by: Alex Suraci <alex@dagger.io> * fix stdio not making it out of gateway containers Signed-off-by: Alex Suraci <alex@dagger.io> * fix terminal support Signed-off-by: Alex Suraci <alex@dagger.io> * drain immediately when interrupted otherwise we can get stuck waiting for child spans of a nested process that got kill -9'd. not perfect but better than hanging on Ctrl+C which is already an emergent situation where you're not likely that interested in any remaining data if you already had a reason to interrupt. in Cloud we'll clean up any orphaned spans based on keepalives anyway. Signed-off-by: Alex Suraci <alex@dagger.io> * fix unintentionally HTTP-ifying gRPC otlp enpoint Signed-off-by: Alex Suraci <alex@dagger.io> * give up retrying connection if outer ctx canceled Signed-off-by: Alex Suraci <alex@dagger.io> * initiate draining only when main client goes away Signed-off-by: Alex Suraci <alex@dagger.io> * appease linter Signed-off-by: Alex Suraci <alex@dagger.io> * remove unnecessary wait we don't need to try synchronizing here now that we just generically wait for all spans to complete Signed-off-by: Alex Suraci <alex@dagger.io> * fix panic if no telemetry Signed-off-by: Alex Suraci <alex@dagger.io> * remove debug log Signed-off-by: Alex Suraci <alex@dagger.io> * print final progress tree in plain mode no substitute for live console streaming, but easier to implement for now, and probably easier to read in CI. probably needs more work, but might get some tests passing. Signed-off-by: Alex Suraci <alex@dagger.io> * fix Windows build Signed-off-by: Alex Suraci <alex@dagger.io> * propagate spans through dagger-in-dagger Signed-off-by: Alex Suraci <alex@dagger.io> * retry connecting to telemetry Signed-off-by: Alex Suraci <alex@dagger.io> * propagate span context through dagger run Signed-off-by: Alex Suraci <alex@dagger.io> * install default labels as otel resource attrs Signed-off-by: Alex Suraci <alex@dagger.io> * tidy Signed-off-by: Alex Suraci <alex@dagger.io> * remove pipeline tests these are expected to fail now Signed-off-by: Alex Suraci <alex@dagger.io> * fail root span when command fails Signed-off-by: Alex Suraci <alex@dagger.io> * Container.import: add span for streaming image Signed-off-by: Alex Suraci <alex@dagger.io> * idtui: break encapsulation in case of errors Signed-off-by: Alex Suraci <alex@dagger.io> * fix schema-level logging not exporting caught by TestDaggerUp/random Signed-off-by: Alex Suraci <alex@dagger.io> * update TestDaggerRun assertion Signed-off-by: Alex Suraci <alex@dagger.io> * fix test not syncing on progress completion Signed-off-by: Alex Suraci <alex@dagger.io> * add verbose debug log Signed-off-by: Alex Suraci <alex@dagger.io> * respect $DAGGER_CLOUD_URL and $DAGGER_CLOUD_TOKEN promoting these from _EXPERIMENTAL along the way, which has already been done for _TOKEN, don't really see a strong reason to keep the _EXPERIMENTAL prefix, but low conviction Signed-off-by: Alex Suraci <alex@dagger.io> * port 'processor: support span keepalive' originally aluzzardi/otel-in-flight@2fc011f Signed-off-by: Alex Suraci <alex@dagger.io> * add 'watch' command really helps with troubleshooting hanging tests! Signed-off-by: Alex Suraci <alex@dagger.io> * set a reasonable window size in plain mode otherwise the terminals resize a ton of times when a long string is printed, absolutely tanking performance. would be nice if that were fast, but no time for that now. Signed-off-by: Alex Suraci <alex@dagger.io> * manually revert container.import change i thought this wouldn't break it, but ... ? Signed-off-by: Alex Suraci <alex@dagger.io> * fix race Signed-off-by: Alex Suraci <alex@dagger.io> * mark watch command experimental Signed-off-by: Alex Suraci <alex@dagger.io> * fixup lock, more logging Signed-off-by: Alex Suraci <alex@dagger.io> * tidy Signed-off-by: Alex Suraci <alex@dagger.io> * fix data race in tests Signed-off-by: Alex Suraci <alex@dagger.io> * fix java SDK hang once again really not sure what's writing to stderr even with --silent but this is just too brittle. redirect stderr to /dev/null instead. Signed-off-by: Alex Suraci <alex@dagger.io> * retire dagger.io/ui.primary, use root span instead fixes Views test; frontend must have been getting confused because there were multiple "primary" spans Signed-off-by: Alex Suraci <alex@dagger.io> * take 2: just manually mark the 'primary' span Signed-off-by: Alex Suraci <alex@dagger.io> * merge tracing and telemetry packages Signed-off-by: Alex Suraci <alex@dagger.io> * cleanups Signed-off-by: Alex Suraci <alex@dagger.io> * roll back sync detach change this was no longer needed with the change to wait for spans to finish, not worth the review-time distraction Signed-off-by: Alex Suraci <alex@dagger.io> * cleanups Signed-off-by: Alex Suraci <alex@dagger.io> * update comment Signed-off-by: Alex Suraci <alex@dagger.io> * remove dead code Signed-off-by: Alex Suraci <alex@dagger.io> * default primary span to root span Signed-off-by: Alex Suraci <alex@dagger.io> * remove unused module arg Signed-off-by: Alex Suraci <alex@dagger.io> * send engine traces/logs to cloud Signed-off-by: Alex Suraci <alex@dagger.io> * implement sub metrics pub/sub Some clients presume this service is supported by the OTLP endpoint. So we can just have a stub implementation for now. Signed-off-by: Alex Suraci <alex@dagger.io> * sdk/go runtime: implement otel propagation TODO: set up otel for you Signed-off-by: Alex Suraci <alex@dagger.io> * tidy Signed-off-by: Alex Suraci <alex@dagger.io> * add scary comment Signed-off-by: Alex Suraci <alex@dagger.io> * batch events that are sent from the engine Previously we were just sending each individual update to the configured exporters, which was very expensive and would even slow down the TUI. When I originally tried to send it to span processors, nothing would be sent out; turns out that was because the transform.Spans call we were using didn't set the `Sampled` trace flag. Now we forward engine traces and logs to all configured processors, so their individual batching settings should be respected. Signed-off-by: Alex Suraci <alex@dagger.io> * fix spans being deduped within single batch * fix detection for in flight spans; we need to check EndTime < StartTime since sometimes we end up with a 1754 timestamp * when a span is already present in a batch, update it in-place rather than dropping it on the floor Signed-off-by: Alex Suraci <alex@dagger.io> * Add Python support Signed-off-by: Helder Correia <174525+helderco@users.noreply.github.com> * shim: proxy otel to 127.0.0.1:0 more universally compatible than unix:// Signed-off-by: Alex Suraci <alex@dagger.io> * remove unnecesssary fn Signed-off-by: Alex Suraci <alex@dagger.io> * attributes: add passthrough, bikeshed + document also start cleaning up "tasks" cruft nonsense, these can just be plain old attributes on a single span i think Signed-off-by: Alex Suraci <alex@dagger.io> * fix janky flag parsing parse global flags in two passes, ensuring the same flags are installed in both cases, and capturing the values before installing them into the real flag set, since that clobbers the values Signed-off-by: Alex Suraci <alex@dagger.io> * discard Buildkit progress ...just in case it gets buffered in memory forever otherwise Signed-off-by: Alex Suraci <alex@dagger.io> * sdk/go: somewhat gross support for opentelemetry had to copy-paste a lot of the telemetry code into sdk/go/. would love to just move everything there so it can be shared between the shim, the Go runtime, and the engine, however it is currently a huge PITA to share code between all three, because of the way codegen works. saving that for another day. maybe tomorrow. Signed-off-by: Alex Suraci <alex@dagger.io> * send logs to function call span, not exec /runtime Signed-off-by: Alex Suraci <alex@dagger.io> * tui: respect dagger.io/ui.mask no more exec /runtime! Signed-off-by: Alex Suraci <alex@dagger.io> * silence linter worth refactoring, but not now™ Signed-off-by: Alex Suraci <alex@dagger.io> * ignore --help when parsing global flags Signed-off-by: Alex Suraci <alex@dagger.io> * Pin python requirements Signed-off-by: Helder Correia <174525+helderco@users.noreply.github.com> * revert Python SDK changes for now looks like there's more to figure out with module dependencies? either way, don't want this to block the current PR, they can be re-introduced in another PR like the other SDKs Revert "Pin python requirements" This reverts commit b40c411. Revert "Add Python support" This reverts commit 08aa92c. Signed-off-by: Alex Suraci <alex@dagger.io> * fix race conditions in python SDK runtime Signed-off-by: Alex Suraci <alex@dagger.io> --------- Signed-off-by: Alex Suraci <alex@dagger.io> Signed-off-by: Helder Correia <174525+helderco@users.noreply.github.com> Co-authored-by: Helder Correia <174525+helderco@users.noreply.github.com>
- Loading branch information