-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip validation checkers when there are no config changes #8206
Conversation
Where this is most needed for performance gains, large clusters with lots of config/workloads/services, I don't think this approach will provide enough of a perf gain in practice because workloads in particular will be constantly changing so the odds of the validations being skipped will be low. The resource version changes even when the status updates on an object so pods coming and going would invalidate the hash. This would be more work to setup initially but if we can validate a single object when it changes or when any of it's related objects change rather than looking at every object and processing all validations when a single object changes that would be much more efficient. Some changes in objects we also don't care about and it would be good to filter those out too. We probably only care if the |
The current way the validations controller processes validations where it processes all validations on a timer was the simplest way to get started but is by no means the most efficient way of doing it. |
It can be a very useful approach in reducing the load on CPU by validation reconciliation, so instead of configuring longer reconcile intervals, we can keep it shorter, but do not trigger validation process when nothing changed. |
OK, that's super annoying, and I didn't expect that behavior. I thought ResourceVersion was independent of status. But I agree that pods coming and going would mostly remove the benefit in a large production env. The intent was to capture an actual change that could affect validation. I agree that the Single object validation granularity is probably feasible, but definitely tricky, given the references and the fact that a lot of information can be required to validate even one object. Like everything, this comes down to effort and payoff, it's not clear to me that the most performant solution is required, at least not yet. We could just stop here and push out the optimizations in the PR we just completed, or I could go forward with this and try hashing workload why did k8s not make it easier to detect non-status workload changes! |
So, here's an interesting thing. I started digging down into the checkers and what they do with workloads. And from what I can tell, the only information that is used in validation checkers, aside from immutable stuff like name, namespace, gvk, are the labels. So actually, the only real change to a workload that we'd need to detect is a change to its labels. |
I updated the POC to not worry about workload's ResourceVersion, unusable as described above. The new approach is to just target changes to mutable workload info that is actually used in the validation checkers, which turns out to just be labels (at least at the moment). |
controller/validations.go
Outdated
@@ -96,6 +96,8 @@ type ValidationsReconciler struct { | |||
validationsService *business.IstioValidationsService | |||
} | |||
|
|||
var changeMap = map[string]string{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be a global var. See comments above about making the changemap internal to ValidationInfo
but either way we'll need to save some state about the current hash. This should go into the Kiali cache. Not the whole validationInfo
object but at least the hash, I guess per cluster. cache.Validations()
could return an object that has the hash(es) in there along with the validations, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I knew it had to change, I just wasn't sure what plumbing to do, and so left it easy for the POC. Before more work I'd like @hhovsepy to run another one of his perf comparisons, to see what's what...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we run the perf comparison on PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rerunning the pref test on this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nrfox I've moved this into the cache, let me know if it looks OK to you.
@jshaughn based on this ^^ it doesn't seem like the optimization will have much impact. Although like @hhovsepy mentioned there is a lot of variability in the results due to the environment and what we are testing. |
I think our cypress perf tests are not the right ones to check on this PR. It requires a pprof comparison of long running perf environment. |
I agree, I'm not sure this is best measured by these tests. Remember that this does not affect individual object validations, only the full validation reconciles. So, impact may be harder to measure, it's mostly about overall server resource usage. The impact is only to avoid running checkers when config doesn't change (at the expense of tracking the config changes). I'm going to make this configurable (undocumented option for now), and maybe we can do that sort of longer-running resource usage comparison. |
The measurement can be in this scenario: |
a1d3594
to
726b120
Compare
- this is too heavy, it looks like we only need to has workload labels...
has of labels for the workloads, because it's the only immutable info we use in the checkers.
…ecessary hash changes
Tested locally, tried all possible CRUDs for Istio Configs, Workloads and Services from Kiali UI and from kubectl. |
Co-authored-by: Hayk Hovsepyan <haik.hovsepian@gmail.com>
- update the perf benchmark test to actually use a changeMap
This is good, it means the code optimization seems to be having a positive impact. The interesting thing is that I don't the benchmark test was passing in a I just pushed a few changes, mainly updating some of the comments, but I also added the change detection to the benchmark. If you could repeat the test again, we can see whether the change detection helps or hurts in this scenario. It would definitely hurt for only one or two validation reconciles, but after that it would hopefully help. It's really a question of whether the map manipulation (and key construction) is more costly than running the checkers. |
@@ -12,7 +12,7 @@ import ( | |||
type WorkloadGroupReferences struct { | |||
WorkloadGroups []*networking_v1.WorkloadGroup | |||
WorkloadEntries []*networking_v1.WorkloadEntry | |||
WorkloadsPerNamespace map[string]models.WorkloadList | |||
WorkloadsPerNamespace map[string]models.Workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a lot of churn in this PR going from WorkloadList
--> Workloads
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it generated massive churn but also a lot of savings. After looking closely I realized there really wasn't any need to be using WorkloadList, and we were doing a lot of work and object creation when generating them.
business/workloads.go
Outdated
@@ -281,9 +282,10 @@ func (in *WorkloadService) GetWorkloadList(ctx context.Context, criteria Workloa | |||
return *workloadList, err | |||
} | |||
|
|||
config := config.Get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using config.Get()
, use in.config
. The config only needs to be created once in main.go
when we first parse it from the configmap. It can be passed down from there. It does not need to be global var. We're already passing it into the Workload
service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I just realized while working on the "Get()" PR that the business services already offer up the config. I'll make the change...
|
||
// validation requires cross-cluster service account information. | ||
vInfo, err := r.validationsService.NewValidationInfo(ctx, r.clusters) | ||
vInfo, err := r.validationsService.NewValidationInfo(ctx, r.clusters, changeMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface here is a little confusing to me. You need to pass in a changeMap
to NewValidationInfo
then you pass vInfo
to Validiate(..., vInfo)
and then you check if Validate(..., vInfo)
returns nil
to determine if validations we're skipped or not?
We have the previous changeMap
already and we can get the new changeMap
from vInfo
. Can we just compare the two change maps and call Validate
conditionally if they haven't changed rather than having to pass the changeMap
into validate and then check if the result is nil
? Generally when functions return nil
even when err == nil
it results in nil pointer derefs down the road.
Some version of:
if oldChangeMap[cluster].hash != newChangeMap[cluster].hash {
Validate(ctx, cluster, vInfo)
} else{
No changes
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it's messy and not a good idea to have a nil return driving the logic here. But I don't really want the caller dealing with the changeMap in any way other than providing and storing it. So, how about if we do sort of a "Go-ish thing, and return a bool indicating whether or not we actually performed validation, or whether it was skipped due to no changes?
I'll also improve the function docs a bit. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think the change detection should happen outside of Validate
but I'm fine with keeping it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, now I think I see more where you are coming from... I agree that it could be nice if the controller could decide whether to perform the validation, or not, depending on whether it optionally already performed change detection. The issue is that to perform the change detection you need to gather up all of the objects for the validation. Because that's built into the Validate logic, it would be a little more disruptive to pull that out. It's doable, maybe even cleaner, but I'd like to stop short of that, at least for now.
// ValidationKey is a pre-calculated key string: "cluster:namespace:name" | ||
ValidationKey string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be an object similar to IstioValidationKey
to reduce the risk of getting the formatting right every time with the string? It looks like the comment is already out of date since the key contains Kind
as well. I don't think any of the fields (cluster,namespace,name,kind) allow :
in the values but if they did that would also cause issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I want to make this an Object, the idea is to pre-calculate an identifying string one time when the workload is pulled in. It will not be parsed, it's only used as the key in the changeMap to identify this workload in the future. I'm fairly sure there is only one workload with the cluster:namespace:name combo but if you think it needs to be more complex I guess we could add to it, or perform a hash of various info.
Kind
is not currently used in this identifier, it's set as described in WorkloadsService.fetchWorkloadsFromCluster
. Maybe you saw the keys being generated for the Istio config? Those are not current pre-calculated, and they do have a hacky "Kind" that I made up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nrfox Maybe we can move this PR forward as-is, but thinking more, maybe it would make sense to pre-calculate an IstioValidationKey for all of the objects when they are pulled in, and then it would be available for use in the changeMap and also everywhere else it is currently [re-]constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then having a some kind of method to construct the key would be nice. Something like validations.ValidationWorkloadKey(cluster, ...)
. Ideally workload parsing would know nothing about the internals of validations since validations are independent of parsing a workload but it's currently a mix. We can keep it as is for the time being.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not unlike my above comment, I like the cleanliness and separation of concern. But to some degree that ship sailed. We have a bunch of calculated information hanging off of the Workloads. I mean, WorkloadListItem is horrible, it's actually a server-side construct designed to support a UI list page. Still, in another PR maybe we can improve this validation key stuff, and not keep piling on. Like you say, we could at least maybe extract the key generation logic, if not the storage location.
Co-authored-by: Hayk Hovsepyan <haik.hovsepian@gmail.com>
- improve the Validate semantics - avoid config.Get()
e453eb6
to
2571be7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, checked locally.
} | ||
|
||
// loop through the config and gather up their resourceVersions | ||
config := vInfo.clusterInfo.istioConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW if you want to work with a generic kube object you can use runtime.Object
and convert from there with something like: https://pkg.go.dev/k8s.io/apimachinery@v0.32.2/pkg/api/meta#Accessor to access fields common to all kube objects. Not sure it would help here though because of the way istioConfig
is structured where each type has its own field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK as-is.
I see this error in Kiali logs of Jenkins Perf OCP cluster: Seems this happens when the reconciliation is starting, meanwhile some Istio configs are created or deleted. But not sure, this can be also randomly. There is no any periodicity. |
This is because of this: kiali/controller/validations.go Lines 136 to 138 in 05fc8c3
If another thread performs validations and updates the validations cache, then the validations need to start over so as not to overwrite the new changes with old changes. I don't think it's related to this PR. |
Yes, I agree, just noticed recently while testing this PR. |
Part-of #8007
The idea here is to generate a hash for config used in validations, using config ResourceVersion where applicable, and otherwise hashing relevant checker information. For example, we can't use Workload ResourceVersion because it changes when status is updated (and so is sensitive to pods coming and going). If the hash does not change at validation Reconcile time, then skip running the checkers.
This solution is something of a trade-off. It avoids the complexity needed for callback-based watchers on config, or other very granular approaches. But each reconcile still needs to gather the related config, and generate the hash. The savings is in being able to avoid running the checkers. In the data presented in #8007 the checkers accounted for a lot of the validation time, so the hope is that this will be helpful, and push off any need to do something more complicated.
The algorithm works like this:
This also introduces a hidden config option to enable or disable this logic (default is enabled). The option is mainly a fail-safe, just in case we find a flaw in the logic that prevents validation re-generation, or somehow ends up adding to resource consumption.
Operator PR: kiali/kiali-operator#885