Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Begin tracing #844

Closed
wants to merge 2 commits into from
Closed

Conversation

MatthewKhouzam
Copy link

Introduce a tracer to Open-VSX.

This tracer so far instruments the Query, QueryV2 and Search endpoints. It is used to see what kind of load is being placed on open vsx.

The output is in the format (https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview) and can be viewed using Chromium (https://www.chromium.org/developers/how-tos/trace-event-profiling-tool/) or Eclipse Trace Compass (https://www.youtube.com/watch?v=YCdzmcpOrK4).

This patch aims to improve the observability of Open-VSX.

To instrument wrap the code to be traced with a try-with-resources block

To run it, use the `logging.properties` file. The output will be in
an /opt/ folder, however, it can be changed.

Once the intermediate trace is written, calling
`jsonify.py output-trace.json cleaned-trace.json` will generate a trace
file capable of being read in Chromium chrome://tracing as well as
Eclipse Trace Compass with the Trace Event plugin installed from the
incubator.

The logger was taken from eclipse trace compass and is EPLv2. It is originally
written by Genevieve Bastien and I.

Signed-off-by: Matthew Khouzam <matthew.khouzam@ericsson.com>
@vince-fugnitto
Copy link
Member

fyi: @planger

return new ResponseEntity<>(json, HttpStatus.BAD_REQUEST);
}
) {
try (ScopeLog sl = new ScopeLog(LOGGER, Level.FINER, "/api/-/search")) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a special class for logging might not be a good idea. For OpenVSX to survive it will be necessary to introduce some Application Performance Monitoring in the future, with its own monitoring. Instead to adjust the code for new logging sinks, I would propose to instrument the code with some standard logging, such as OpenTelemetry. Take a look at their server example. The code looks quite similar, but it is a standard API which could be listened by many log/trace implementations.

Copy link
Author

@MatthewKhouzam MatthewKhouzam Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fair assessment. I see monitoring and postmortem tracing as two separate issues. What I would like in the future is to add OTEL to the scopelogs so that the loggers can feed monitoring.

I was tempted to use Otel, my concern was that

1- the community would not accept such a big change
2- since the spans are not auto-closeable, you need to remember to span.end() them.

If these concessions are acceptable, we can do it.

@@ -0,0 +1,65 @@
###############################################################################
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a big surprise for me that the application doesn't have any logging configured yet. I think to add some logging is definitely a good idea, but I am not sure having a SnapshotHandler as always on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot handler is only on if we use this logging.properties.

There are techniques to programatically activate/deactivate it though. I did not add one like an SSH command. The catch is when you're observing a slowdown, you already need monitoring to be on.

Also I should update the copyright header. (happy 2024!)

&& (record.getLevel().intValue() <= Level.FINE.intValue())
&& (pattern.matcher(record.getMessage()).matches()));
if (logIt) { // add feature switch here
addToSnapshot(record.getMessage());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it is not a good idea at all to call any "command" methods in such "query"-by-design method as isLoggable(). The method should be read-only by all the specifications.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the solution!

public boolean isLoggable(LogRecord record) {
return
(fIsEnabled
&& super.isLoggable(record)
&& (record.getLevel().intValue() <= Level.FINE.intValue())
&& (pattern.matcher(record.getMessage()).matches()));
}

then override publish to put addToSnapshot there

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed?

}

private boolean addToSnapshot(String message) {
InnerEvent event = InnerEvent.create(message);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure serializing parameters to a message and then serializing them is good idea, both from the performance and the design point of view. What do you think about LogRecord.parameters, will it be able to do the job?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will update and check. The caveat: there's no "supplier" method with paramters, which means the string is always made AFAIK. That is why I made the string with the supplier. In other applications it made more sense, in VSX, not so much.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at both parameters and suppliers. For today, suppliers are quite simpler. But I am not against doing parameters if it's a strong opinion on your side. If you want to do parameters later, I am not against at all too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it. Now the parameters are lazy-serialized on getMessage, in the thread to drain. I think this is as good as it gets. ;)

*/
public static class ScopeLog implements AutoCloseable {

private final long fTime;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like f<name> pattern is not used anywhere else in the code, and even the PR contains another style to name the class members. I would propose to choose a single style and then stick to it.

*
* @author Geneviève Bastien
*/
public static class FlowScopeLogBuilder {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any usages of the class in the other parts of the code. It is really necessary there?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future use, we use it in trace compass to have spans with links.

server/jsonify.py Show resolved Hide resolved
@MatthewKhouzam
Copy link
Author

Minor update.

This is a way to take a snapshot of any request that takes over n time.

The SnapshotHandler will consume every event and keep it in memory,
(up to maxEvents). This will then dump the full trace to disk when the
specified threshold is met.

To use the snapshotter, pass the following arguments to OpenVSX when
starting it up:
`-Djava.util.logging.config.file=%gitroot%/logging.properties (where %gitroot% is the directory tracecompass is checked out to)`
The logging.properties should be updated to meet the current configuration of
the Open-VSX server.

Performance impact with ONE user on ONE computer was imperceptible, however,
this needs to be verified prior to deployment in production.

Traces will be placed in the folder specified by `logging.properties`.
The traces are following the trace event standard and can be opened with
chromium or Eclipse Trace Compass.

The logger is based on the Trace Compass logger, which is also EPLv2.

Signed-off-by: Matthew Khouzam <matthew.khouzam@ericsson.com>
@MatthewKhouzam
Copy link
Author

This is a decent sized update, it fixes most things except for the code style. This I would leave up to the actual maintainers of the code.

I see this patch may improve making data accessible in a GDPR friendly anonymized way. Right now we only record API calls, not who is calling or what is returned. This is anonymous and can be "public".

@amvanbaren
Copy link
Contributor

Thanks for the contribution @MatthewKhouzam. We use Spring Boot Actuator to manage and monitor the server. The data can be collected by Prometheus and displayed in Grafana, as outlined in this proof of concept: #667

What data are you looking to collect?

@MatthewKhouzam
Copy link
Author

MatthewKhouzam commented Jan 25, 2024

The data would be internal. At the moment, you can see that the api is logged, but also some inner for loops. They would be corelated so you can see which query is slow and more importantly why it is slow.

To be clear, Spring Boot Actuator is good at monitoring, this is for investigating.

@amvanbaren
Copy link
Contributor

There already is LongRunningRequestFilter for investigating.

@MatthewKhouzam
Copy link
Author

Either way, I would update this patch to use https://github.com/eclipse-tracecompass/trace-event-logger. Regarding LongRunningRequestFilter, as far as I can tell, it logs the request, not what happens inside the request. This would correlate a request to a flame chart, and flame graph.

https://www.youtube.com/watch?v=ZlAZy7GjXRE&pp=ygUWb2JzZXJ2YWJpbGl0eSB0aGVpYWNvbg%3D%3D

Here is an example of what we can show with this patch

flame-charts.mp4

@MatthewKhouzam
Copy link
Author

The reasoning of this patch is to not show when the code is slow. Which is handled well with Spring boot actuator, but rather what is making it slow. For that you either need deeper instrumentation. This is why chrome://tracing exists alongside, Otel. Both tools have their purpose. This patch allows chome tracing, perfetto or tracecompass to analyze the issue.

@amvanbaren
Copy link
Contributor

Here is an example of what we can show with this patch

Interesting, this from your own or Ericsson's Open VSX instance?

For that you either need deeper instrumentation.

Right, like https://www.baeldung.com/spring-boot-opentelemetry-setup or https://grafana.com/blog/2022/05/04/how-to-capture-spring-boot-metrics-with-the-opentelemetry-java-instrumentation-agent/.
I don't think handcoding the observability instrumentation is the way to go.

@MatthewKhouzam
Copy link
Author

Here is an example of what we can show with this patch

Interesting, this from your own or Ericsson's Open VSX instance?

It is from an instance of Trace Compass traced, as well as an example from an OS level trace. VSX was not shown as the data is not super interesting with the gatling traces. It was just a blob. :(

For that you either need deeper instrumentation.

Right, like https://www.baeldung.com/spring-boot-opentelemetry-setup

This is the problem

optel

or https://grafana.com/blog/2022/05/04/how-to-capture-spring-boot-metrics-with-the-opentelemetry-java-instrumentation-agent/ .

This is even less interesting as we are dealing with KPIs, which are inferred and aggregated. A lot of information is lost.

I don't think handcoding the observability instrumentation is the way to go.

That is a fair point, however, the tracer will be abstracted out soon. I have a project in eclipse for that. :) (Please thank IP team for their FAST response)

For better observability within VSX, you need to add spans using open telemetry, or something similar in this library.

Here is an example of what the end goal would look like.

image

All manual work is there to add precision beyond what an auto-instrumenting program can supply. Think of it like JVMTI profiler, if it instruments everything, your program will collapse under the weight of the probes. OTel just does the endpoints. We are going deeper, but you need to define how deep.

Here are some references that can help also, they are a bit deep though:
https://www.dorsal.polymtl.ca/files/may2022/mekhlasi_may2022_performance_issues_detection.pdf
https://amdls.dorsal.polymtl.ca/files/2019-05-06%20Progress%20Report%20Meeting.pdf
https://www.dorsal.polymtl.ca/files/december2022/asangam_december2022_tracing_tools_low_latency.pdf

@amvanbaren
Copy link
Contributor

For better observability within VSX, you need to add spans using open telemetry

Spring supports that: "We want to have some detailed observation of the MyUserService#userName method. Thanks to having added AOP support, we can use the @Observed annotation. To do so, we can register a ObservedAspect bean."
source: https://spring.io/blog/2022/10/12/observability-with-spring-boot-3#webmvc-server-code

@MatthewKhouzam
Copy link
Author

Yes, on a per function level, not on a per-scope level.

A way to side step that would be to keep all functions simple though. :)

@kkistm
Copy link

kkistm commented Feb 2, 2024

@MatthewKhouzam, I still think that for an online service the tracing should be online, also, delivering the actual information at any moment of time. TraceCompass is an amazing tool, but I would prefer to have something OpenTelemetry-based, such as Jaeger.
TraceCompass could then be used by the developers to find hotspots in the code while working on the code locally. But even then I would prefer some profile-style tool which doesn't need changes in the code, but uses some IDE integration instead.
What do you think about it.

@MatthewKhouzam
Copy link
Author

MatthewKhouzam commented Feb 2, 2024

I am OK with anything, as long as there's something. I was proposing the chromium/trace compass solution since it is something we used internally, and there are many features not available yet in jaeger. (Flamegraph, lookup stats, kernel correlation) but really... we just need something. :)

Regarding tracing seeing info at any moment, that's more monitoring, super important too. Post mortems and deep dives are where this patch was focusing.

@amvanbaren
Copy link
Contributor

I'm in the process of setting up monitoring with Grafana Cloud. Once monitoring works I want to add tracing/observability using OpenTelemetry. Grafana offers flame graphs, but I don't know if it "just works" or needs extra configuration.

@amvanbaren
Copy link
Contributor

Yes, on a per function level, not on a per-scope level.

You can also do it manually:

@Service
public class MyObservableService {

  private final ObservationRegistry observationRegistry;

  public MyObservableService(ObservationRegistry observationRegistry) {
    this.observationRegistry = observationRegistry;
  }

  public void tracedMethod() {
    Observation.createNotStarted("tracedMethod", this.observationRegistry)
      .lowCardinalityKeyValue("locale", "en-US")
      .highCardinalityKeyValue("userId", "77")
      .observe(() -> {
        // Write business logic here, all this code will now be traced/observed by this newly created "Observation" 
        // using the observationRegistry created earlier.
      });
  }
}

source: https://www.unlogged.io/post/enhanced-observability-with-java-21-and-spring-3-2
This is a good overview post: https://www.baeldung.com/spring-boot-3-observability

@MatthewKhouzam
Copy link
Author

I think #857 will do part of what this patch does.

If you want any feature of https://github.com/eclipse-tracecompass/trace-event-logger, you know where to find it.

@amvanbaren amvanbaren closed this Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants