Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip validation checkers when there are no config changes #8206

Merged
merged 14 commits into from
Mar 12, 2025

Conversation

jshaughn
Copy link
Collaborator

@jshaughn jshaughn commented Mar 4, 2025

Part-of #8007

The idea here is to generate a hash for config used in validations, using config ResourceVersion where applicable, and otherwise hashing relevant checker information. For example, we can't use Workload ResourceVersion because it changes when status is updated (and so is sensitive to pods coming and going). If the hash does not change at validation Reconcile time, then skip running the checkers.

This solution is something of a trade-off. It avoids the complexity needed for callback-based watchers on config, or other very granular approaches. But each reconcile still needs to gather the related config, and generate the hash. The savings is in being able to avoid running the checkers. In the data presented in #8007 the checkers accounted for a lot of the validation time, so the hope is that this will be helpful, and push off any need to do something more complicated.

The algorithm works like this:

For base config (used for cross-cluster validation)
  generate a hash for the config (basically workload config)

For each cluster:
  generate a hash for the config defined in that cluster (services, all of the istio config)
  if hash is different or there is a base config change
    run checkers for the cluster
    replace previous validations
  else 
    keep previous validations

This also introduces a hidden config option to enable or disable this logic (default is enabled). The option is mainly a fail-safe, just in case we find a flaw in the logic that prevents validation re-generation, or somehow ends up adding to resource consumption.

Operator PR: kiali/kiali-operator#885

@jshaughn jshaughn self-assigned this Mar 4, 2025
@jshaughn
Copy link
Collaborator Author

jshaughn commented Mar 4, 2025

@hhovsepy , @nrfox , I created a POC here to see if this approach seemed feasible and worthwhile. Before moving forward I wanted to get your feedback on approach, or other general comments. We could have a short meeting, or you could drop comments here...

@jshaughn jshaughn requested review from hhovsepy and nrfox March 4, 2025 15:56
@nrfox
Copy link
Contributor

nrfox commented Mar 4, 2025

Where this is most needed for performance gains, large clusters with lots of config/workloads/services, I don't think this approach will provide enough of a perf gain in practice because workloads in particular will be constantly changing so the odds of the validations being skipped will be low. The resource version changes even when the status updates on an object so pods coming and going would invalidate the hash.

This would be more work to setup initially but if we can validate a single object when it changes or when any of it's related objects change rather than looking at every object and processing all validations when a single object changes that would be much more efficient. Some changes in objects we also don't care about and it would be good to filter those out too. We probably only care if the spec changes on various objects but not the status for validations.

@nrfox
Copy link
Contributor

nrfox commented Mar 4, 2025

The current way the validations controller processes validations where it processes all validations on a timer was the simplest way to get started but is by no means the most efficient way of doing it.

@hhovsepy
Copy link
Contributor

hhovsepy commented Mar 4, 2025

It can be a very useful approach in reducing the load on CPU by validation reconciliation, so instead of configuring longer reconcile intervals, we can keep it shorter, but do not trigger validation process when nothing changed.

@jshaughn
Copy link
Collaborator Author

jshaughn commented Mar 4, 2025

The resource version changes even when the status updates on an object so pods coming and going would invalidate the hash.

OK, that's super annoying, and I didn't expect that behavior. I thought ResourceVersion was independent of status. But I agree that pods coming and going would mostly remove the benefit in a large production env. The intent was to capture an actual change that could affect validation.

I agree that the spec is what we care about. I suppose we could hash Spec at workload fetch time. This would add some overhead when populating the cache, but it's feasible. In essence, we'd be creating our own "WorkloadSpecVersion" field. I think that would be the only way forward with this approach.

Single object validation granularity is probably feasible, but definitely tricky, given the references and the fact that a lot of information can be required to validate even one object. Like everything, this comes down to effort and payoff, it's not clear to me that the most performant solution is required, at least not yet. We could just stop here and push out the optimizations in the PR we just completed, or I could go forward with this and try hashing workload spec. Not sure I want to bite off more without more user feedback.

why did k8s not make it easier to detect non-status workload changes!

@jshaughn
Copy link
Collaborator Author

jshaughn commented Mar 4, 2025

So, here's an interesting thing. I started digging down into the checkers and what they do with workloads. And from what I can tell, the only information that is used in validation checkers, aside from immutable stuff like name, namespace, gvk, are the labels. So actually, the only real change to a workload that we'd need to detect is a change to its labels.

@jshaughn
Copy link
Collaborator Author

jshaughn commented Mar 4, 2025

I updated the POC to not worry about workload's ResourceVersion, unusable as described above. The new approach is to just target changes to mutable workload info that is actually used in the validation checkers, which turns out to just be labels (at least at the moment).

@@ -96,6 +96,8 @@ type ValidationsReconciler struct {
validationsService *business.IstioValidationsService
}

var changeMap = map[string]string{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be a global var. See comments above about making the changemap internal to ValidationInfo but either way we'll need to save some state about the current hash. This should go into the Kiali cache. Not the whole validationInfo object but at least the hash, I guess per cluster. cache.Validations() could return an object that has the hash(es) in there along with the validations, for example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I knew it had to change, I just wasn't sure what plumbing to do, and so left it easy for the POC. Before more work I'd like @hhovsepy to run another one of his perf comparisons, to see what's what...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run the perf comparison on PRs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rerunning the pref test on this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nrfox I've moved this into the cache, let me know if it looks OK to you.

@jshaughn
Copy link
Collaborator Author

jshaughn commented Mar 5, 2025

@nrfox @hhovsepy OK, given our discussion and some initial perf delta results, I'm going to flesh out the POC today, it seems worthwhile...

@hhovsepy
Copy link
Contributor

hhovsepy commented Mar 5, 2025

From Istio config list perspective it is more or less the same results. However considering that the optimization is in backend process.
The rest of results can be env instability issues.
Figure_1

@nrfox
Copy link
Contributor

nrfox commented Mar 5, 2025

From Istio config list perspective it is more or less the same results. However considering that the optimization is in backend process. The rest of results can be env instability issues. Figure_1

@jshaughn based on this ^^ it doesn't seem like the optimization will have much impact. Although like @hhovsepy mentioned there is a lot of variability in the results due to the environment and what we are testing.

@hhovsepy
Copy link
Contributor

hhovsepy commented Mar 5, 2025

I think our cypress perf tests are not the right ones to check on this PR. It requires a pprof comparison of long running perf environment.

@jshaughn
Copy link
Collaborator Author

jshaughn commented Mar 5, 2025

I think our cypress perf tests are not the right ones to check on this PR. It requires a pprof comparison of long running perf environment.

I agree, I'm not sure this is best measured by these tests. Remember that this does not affect individual object validations, only the full validation reconciles. So, impact may be harder to measure, it's mostly about overall server resource usage.

The impact is only to avoid running checkers when config doesn't change (at the expense of tracking the config changes). I'm going to make this configurable (undocumented option for now), and maybe we can do that sort of longer-running resource usage comparison.

@hhovsepy
Copy link
Contributor

hhovsepy commented Mar 5, 2025

The measurement can be in this scenario:
Start pprof monitoring.
Start perf tests.
Leave the env after perf tests finishes.
Stop pprof monitoring.
Measure the results and compare before/after this PR.
For this I think we need this PR rebased with master for the the benchmark tests merged today.

@jshaughn jshaughn force-pushed the kiali8007-hash branch 2 times, most recently from a1d3594 to 726b120 Compare March 5, 2025 20:33
@jshaughn
Copy link
Collaborator Author

jshaughn commented Mar 5, 2025

@hhovsepy , @nrfox OK, I'm going to promote this PR as I don't have any more pending work or ideas. We can run some more tests and decide whether it's worth adding.

@jshaughn jshaughn marked this pull request as ready for review March 5, 2025 20:41
@hhovsepy
Copy link
Contributor

hhovsepy commented Mar 6, 2025

It seems the checking of changes in hash is too sensitive,
just started a kind cluster locally, put a log in code for changed and noting changed.
And the result is that there are more changes even though I did nothing in Istio config side directly, there are some generated objects always which we are ignoring, maybe they are considered?
Screenshot from 2025-03-06 15-48-23
Screenshot from 2025-03-06 15-40-07

jshaughn added 6 commits March 6, 2025 13:46
- keep a hash of config resourceVersions to determine if there have been
  changes, and therefore whether we need to run validation checkers

kiali8007
- this is too heavy, it looks like we only need to has workload labels...
  has of labels for the workloads, because it's the only immutable info
  we use in the checkers.
@hhovsepy
Copy link
Contributor

hhovsepy commented Mar 7, 2025

Tested locally, tried all possible CRUDs for Istio Configs, Workloads and Services from Kiali UI and from kubectl.
The changes are recognized correctly, and validations are recalculated, otherwise when no changes done, the re-validation is skipped.
The cypress perf suite did not find neither regression nor improvements, which are not expected on UI loading side.
The code looks good in it's idea's scope, just one small change request for fixing the ServiceAccounts for the failing integration test.
Did not run pprof yet, however it is clear that skipping the recalculation of validations will save a CPU.
One question here: In comparison, what is the load percentage on CPU and Memeory while loading and hashing all resources VS the pure Validation calculation process itself? TODO here to run pprof.

jshaughn and others added 2 commits March 7, 2025 08:55
Co-authored-by: Hayk Hovsepyan <haik.hovsepian@gmail.com>
@hhovsepy
Copy link
Contributor

hhovsepy commented Mar 7, 2025

The CPU pprof results locally for BenchmarkValidate:
On Master, 2.40s:
Screenshot from 2025-03-07 18-07-24

On this PR, when forced no config changes, 1.50s:
Screenshot from 2025-03-07 18-08-18

- update the perf benchmark test to actually use a changeMap
@jshaughn
Copy link
Collaborator Author

jshaughn commented Mar 7, 2025

On this PR, when forced no config changes, 1.50s:

This is good, it means the code optimization seems to be having a positive impact. The interesting thing is that I don't the benchmark test was passing in a changeMap, and so I don't think the changeDetection logic was actually in use :-)

I just pushed a few changes, mainly updating some of the comments, but I also added the change detection to the benchmark. If you could repeat the test again, we can see whether the change detection helps or hurts in this scenario. It would definitely hurt for only one or two validation reconciles, but after that it would hopefully help. It's really a question of whether the map manipulation (and key construction) is more costly than running the checkers.

@hhovsepy
Copy link
Contributor

Rerun on the latest changes:
Screenshot from 2025-03-10 15-09-17

@jshaughn
Copy link
Collaborator Author

Rerun on the latest changes:

Excellent, consistent with the prior run. Thanks, @hhovsepy . I'm going to make the enable/disable config public with a small operator PR (default will be enabled). But I think this PR can move to final review. cc @nrfox .

@@ -12,7 +12,7 @@ import (
type WorkloadGroupReferences struct {
WorkloadGroups []*networking_v1.WorkloadGroup
WorkloadEntries []*networking_v1.WorkloadEntry
WorkloadsPerNamespace map[string]models.WorkloadList
WorkloadsPerNamespace map[string]models.Workloads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a lot of churn in this PR going from WorkloadList --> Workloads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it generated massive churn but also a lot of savings. After looking closely I realized there really wasn't any need to be using WorkloadList, and we were doing a lot of work and object creation when generating them.

@@ -281,9 +282,10 @@ func (in *WorkloadService) GetWorkloadList(ctx context.Context, criteria Workloa
return *workloadList, err
}

config := config.Get()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using config.Get(), use in.config. The config only needs to be created once in main.go when we first parse it from the configmap. It can be passed down from there. It does not need to be global var. We're already passing it into the Workload service.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I just realized while working on the "Get()" PR that the business services already offer up the config. I'll make the change...


// validation requires cross-cluster service account information.
vInfo, err := r.validationsService.NewValidationInfo(ctx, r.clusters)
vInfo, err := r.validationsService.NewValidationInfo(ctx, r.clusters, changeMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface here is a little confusing to me. You need to pass in a changeMap to NewValidationInfo then you pass vInfo to Validiate(..., vInfo) and then you check if Validate(..., vInfo) returns nil to determine if validations we're skipped or not?

We have the previous changeMap already and we can get the new changeMap from vInfo. Can we just compare the two change maps and call Validate conditionally if they haven't changed rather than having to pass the changeMap into validate and then check if the result is nil? Generally when functions return nil even when err == nil it results in nil pointer derefs down the road.

Some version of:

if oldChangeMap[cluster].hash != newChangeMap[cluster].hash {
  Validate(ctx, cluster, vInfo)
} else{
  No changes
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's messy and not a good idea to have a nil return driving the logic here. But I don't really want the caller dealing with the changeMap in any way other than providing and storing it. So, how about if we do sort of a "Go-ish thing, and return a bool indicating whether or not we actually performed validation, or whether it was skipped due to no changes?

I'll also improve the function docs a bit. Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the change detection should happen outside of Validate but I'm fine with keeping it as is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now I think I see more where you are coming from... I agree that it could be nice if the controller could decide whether to perform the validation, or not, depending on whether it optionally already performed change detection. The issue is that to perform the change detection you need to gather up all of the objects for the validation. Because that's built into the Validate logic, it would be a little more disruptive to pull that out. It's doable, maybe even cleaner, but I'd like to stop short of that, at least for now.

Comment on lines +176 to +177
// ValidationKey is a pre-calculated key string: "cluster:namespace:name"
ValidationKey string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an object similar to IstioValidationKey to reduce the risk of getting the formatting right every time with the string? It looks like the comment is already out of date since the key contains Kind as well. I don't think any of the fields (cluster,namespace,name,kind) allow : in the values but if they did that would also cause issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I want to make this an Object, the idea is to pre-calculate an identifying string one time when the workload is pulled in. It will not be parsed, it's only used as the key in the changeMap to identify this workload in the future. I'm fairly sure there is only one workload with the cluster:namespace:name combo but if you think it needs to be more complex I guess we could add to it, or perform a hash of various info.

Kind is not currently used in this identifier, it's set as described in WorkloadsService.fetchWorkloadsFromCluster. Maybe you saw the keys being generated for the Istio config? Those are not current pre-calculated, and they do have a hacky "Kind" that I made up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nrfox Maybe we can move this PR forward as-is, but thinking more, maybe it would make sense to pre-calculate an IstioValidationKey for all of the objects when they are pulled in, and then it would be available for use in the changeMap and also everywhere else it is currently [re-]constructed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then having a some kind of method to construct the key would be nice. Something like validations.ValidationWorkloadKey(cluster, ...). Ideally workload parsing would know nothing about the internals of validations since validations are independent of parsing a workload but it's currently a mix. We can keep it as is for the time being.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not unlike my above comment, I like the cleanliness and separation of concern. But to some degree that ship sailed. We have a bunch of calculated information hanging off of the Workloads. I mean, WorkloadListItem is horrible, it's actually a server-side construct designed to support a UI list page. Still, in another PR maybe we can improve this validation key stuff, and not keep piling on. Like you say, we could at least maybe extract the key generation logic, if not the storage location.

Co-authored-by: Hayk Hovsepyan <haik.hovsepian@gmail.com>
@jshaughn
Copy link
Collaborator Author

@hhovsepy @nrfox thanks for feedback, pulled Hayk's suggestion, pushed some changes, and closed the operator PR...

@jshaughn jshaughn requested review from hhovsepy and nrfox March 11, 2025 01:23
- improve the Validate semantics
- avoid config.Get()
Copy link
Contributor

@hhovsepy hhovsepy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, checked locally.

}

// loop through the config and gather up their resourceVersions
config := vInfo.clusterInfo.istioConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW if you want to work with a generic kube object you can use runtime.Object and convert from there with something like: https://pkg.go.dev/k8s.io/apimachinery@v0.32.2/pkg/api/meta#Accessor to access fields common to all kube objects. Not sure it would help here though because of the way istioConfig is structured where each type has its own field.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK as-is.

@nrfox
Copy link
Contributor

nrfox commented Mar 11, 2025

@jshaughn can we keep #8007 open? I don't think this PR should close that issue. Not all of the outstanding tasks have been addressed and it's yet to be seen if this approach will provide a significant enough perf gain to close out that issue.

@hhovsepy
Copy link
Contributor

I see this error in Kiali logs of Jenkins Perf OCP cluster:
'{"level":"error","ts":"2025-03-11T16:17:30Z","msg":"Reconciler error","controller":"validations-controller","object":{"name":"queue","namespace":"queue"},"namespace":"queue","name":"queue","reconcileID":"0dba47cd-7abd-4957-ae91-a1c2fb827706","error":"validations have been updated since reconciling started. Requeuing validation","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/home/fedora/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/home/fedora/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/home/fedora/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/controller/controller.go:224"}'

Seems this happens when the reconciliation is starting, meanwhile some Istio configs are created or deleted. But not sure, this can be also randomly. There is no any periodicity.

@nrfox
Copy link
Contributor

nrfox commented Mar 11, 2025

'{"level":"error","ts":"2025-03-11T16:17:30Z","msg":"Reconciler error","controller":"validations-controller","object":{"name":"queue","namespace":"queue"},"namespace":"queue","name":"queue","reconcileID":"0dba47cd-7abd-4957-ae91-a1c2fb827706","error":"validations have been updated since reconciling started. Requeuing validation","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/home/fedora/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/home/fedora/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/home/fedora/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/controller/controller.go:224"}'

This is because of this:

if r.kialiCache.Validations().Version() != version {
return ctrl.Result{}, fmt.Errorf("validations have been updated since reconciling started. Requeuing validation")
}

If another thread performs validations and updates the validations cache, then the validations need to start over so as not to overwrite the new changes with old changes.

I don't think it's related to this PR.

@hhovsepy
Copy link
Contributor

If another thread performs validations and updates the validations cache, then the validations need to start over so as not to overwrite the new changes with old changes.

I don't think it's related to this PR.

Yes, I agree, just noticed recently while testing this PR.

@jshaughn jshaughn merged commit 19c1902 into kiali:master Mar 12, 2025
10 checks passed
@jshaughn jshaughn deleted the kiali8007-hash branch March 12, 2025 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
requires operator PR It requires update in operator code
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants