Performance problem: EKS control plane overwhelmed when applying large number of workloads with MeshServices enabled #12723

bartsmykla · 2025-01-31T12:04:46Z

Warning

This issue is work in progress and describes current knowledge about this problem

Kuma Version

any

Describe the bug

Summary

When applying a large number of services using MeshServices in performance tests, the control plane becomes overwhelmed at around 600-700 services. This leads to excessive logging, issues with syncing deployments, and eventually, etcd failures that cause Kubernetes control plane components to restart.

Observed issues

Frequent deployment sync errors in kube-controller-manager

"Error syncing deployment" deployment="kuma-test/fake-service-001" err="Operation cannot be fulfilled on deployments.apps \"fake-service-001\": the object has been modified; please apply your changes to the latest version and try again"

These errors appear repeatedly for many deployments.

Endpoint slice errors before etcd becomes unstable

"Error syncing endpoint slices for service, retrying" key="kuma-test/fake-service-017" err="EndpointSlice informer cache is out of date"

These errors happen across multiple services, causing delays and resource contention.

Complete failure of etcd connections
- The system eventually cannot handle the load, leading to failures in etcd.
- Kubernetes control plane components restart as a result.

To Reproduce

Enable exclusive MeshServices for the mesh
Deploy a large number of services using fake-service (e.g., 1000 services).
At around 600-700 services, control plane logs start filling up with errors.
Kubernetes components (kube-controller-manager, endpointslice_controller) log multiple retries and failures.
Eventually, connections to etcd fail completely, leading to control plane restarts.

Expected behavior

No response

Additional context (optional)

logs-insights-results-perf-tests-debugging-250131.json

Issues started around 09:50:

The text was updated successfully, but these errors were encountered:

bartsmykla · 2025-02-14T15:53:29Z

Additional data (WIP):

Below details represent perf test runs for amount services (first number) * amount of replicas (second number)

The pink areas on the graphs show when the should distribute certs when mTLS is enabled test run, and the grey areas show the duration of the kuma-test namespace deletion.

100 * 2

200 * 2

300 * 2

400 * 2

500 * 2

600 * 2

700 * 2

800 * 2

900 * 2

1000 * 2

1000 * 2 / ran in the same cluster after earlier 1000 * 2 test run

bartsmykla · 2025-02-21T13:40:17Z

Warning

This comment is Work in Progress

Summary of Investigation on Performance Issues

After switching our performance tests to use MeshServices, we observed consistent failures, primarily due to request timeouts in the AfterEach block when pushing metrics to Prometheus. An investigation was initiated based on the assumption that introducing MeshServices had significantly degraded performance.

Investigation Process & Findings

Initial Assumption: Prometheus Overload
- Suspected that Prometheus' node was overloaded due to test services running on the same node.
- Isolated Prometheus to a dedicated node, but tests continued to fail.
Control Plane Logs Indicated ETCD Issues
- Initially interpreted logs as ETCD crashing, leading to full control plane restarts.
- Assumed that our mutating/validating webhooks might be overloading ETCD.
- This turned out to be incorrect; the actual issue was API server instances exceeding resource limits.
API Server Scaling Issues
- API server instances had a default limit of 1 CPU and 2GB RAM.
- During tests, resource usage exceeded this limit, triggering EKS to scale the control plane.
- API server scaling caused request timeouts when requests hit instances being terminated or were delayed due to overload.
ETCD Calls Analysis
- Event Emission: Initially suspected excessive event emissions related to MeshServices but confirmed events only occur on creation or update, so initial bursts are expected but not problematic.
- Pod Calls vs. Dataplane Calls: Higher pod-related calls were observed, which was traced back to controllers watching for pod changes, especially during sidecar injection.

Observations & Open Issues

MeshServices Calls to ETCD Without Sidecar Injection
- MeshServices controller still sends delete requests when labels change on namespaces, K8s Services, or EndpointSlices.
- Needs further assessment for potential optimization.

Errors During Namespace Deletion

Observed recurring errors such as:

ERROR discovery.k8s.pod-to-dataplane-converter could not get K8S Service for service tag {"error": "failed to get Service \"kuma-test/fake-service-799\": Service \"fake-service-799\" not found"}

ERROR xds.sink failed to flush DataplaneInsight {"dataplaneid": {"Mesh":"default","Name":"fake-service-736-6b479fb987-zjsxr.kuma-test"}, "error": "failed to create k8s resource: dataplaneinsights.kuma.io \"fake-service-736-6b479fb987-zjsxr\" is forbidden: unable to create new content in namespace kuma-test because it is being terminated"}

Issue opened: Avoid unnecessary processing when namespace is terminating #12858
- First error is related to configmap controller and should likely be logged as debug info instead of an error.
- Second error requires further investigation.

High Resource Consumption by Webhook
- mesh.defaulter webhook identified as the highest resource consumer.

Bug Identified & Fix

Found a bug where any namespace change triggered reconciliation for every service in the cluster.
Fix submitted: fix(k8s): prevent reconciling all namespaces on label change #12906

bartsmykla added area/kuma-cp area/performance kind/bug A bug triage/accepted The issue was reviewed and is complete enough to start working on it labels Jan 31, 2025

bartsmykla added this to the 2.10.x milestone Jan 31, 2025

bartsmykla self-assigned this Jan 31, 2025

bartsmykla closed this as completed Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance problem: EKS control plane overwhelmed when applying large number of workloads with MeshServices enabled #12723

Performance problem: EKS control plane overwhelmed when applying large number of workloads with MeshServices enabled #12723

bartsmykla commented Jan 31, 2025 •

edited

Loading

bartsmykla commented Feb 14, 2025 •

edited

Loading

bartsmykla commented Feb 21, 2025 •

edited by lahabana

Loading

Performance problem: EKS control plane overwhelmed when applying large number of workloads with MeshServices enabled #12723

Performance problem: EKS control plane overwhelmed when applying large number of workloads with MeshServices enabled #12723

Comments

bartsmykla commented Jan 31, 2025 • edited Loading

Kuma Version

Describe the bug

Summary

Observed issues

To Reproduce

Expected behavior

Additional context (optional)

bartsmykla commented Feb 14, 2025 • edited Loading

bartsmykla commented Feb 21, 2025 • edited by lahabana Loading

bartsmykla commented Jan 31, 2025 •

edited

Loading

bartsmykla commented Feb 14, 2025 •

edited

Loading

bartsmykla commented Feb 21, 2025 •

edited by lahabana

Loading