-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hanging Pod sandboxes that timeouts during creation and fail to be cleaned up #430
Comments
Hey @marosset, a gentle ping if there is any updates? |
Thanks for flagging this @ibabou, how frequent have you seen this timeout issue occurring? And does does it affect one pod creation at a time, or multiple pod creations on the same node over a given period of time? |
Looking into this. The issue being described here seems similar to kubernetes/kubernetes#107561 |
@ibabou what CNI and version is being used? How many pods/containers does the node have? |
Thanks for looking into this! @fady-azmy-msft , the customers see the issue maybe once every 2-3 weeks, but again we might have missed it if it happened in more clusters and went unnoticed. regarding the occurrences, what I have noticed it'll affect one pod & this will keep stuck as mentioned above for long period. the trials to recreate the sandbox will keep failing, although other pods seems to schedule fine or auto-recover with the recreation. @kiashok Agree, the same behaviour matches what James describes in (2) here. I see the same log with the initial timeout due to syscall and later failing to delete the shim or recreate the sandbox. Tagging @jsturtevant here as well. About the number of pods, I don't have an exact number as this cluster/nodepool is not available now anymore - but I noticed that other pods were getting created without getting in this stuck state. there was no CPU or memory starvation (not sure if maybe multiple pods being created simultaneously could be a trigger). The sdn cni used is v0.2. It's built up to this commit. |
@ibabou we are continuing to take a look. Will keep you posted on the updates. Primarily seeing two main failures from the logs you shared with @MikeZappa87 and I'm suspecting this could be some contention happening at scale although I am not very sure of that just yet. The networking team is also taking a look. Do you have any repro steps that you could share? [Wednesday 12:06 PM] Kirtana Ashok failed to destroy network for sandbox "52ad56234561203338af081d24262de1a9b432456c942569f80d2459c40b831f": plugin type="sdnbridge" name="l2bridge" failed (delete): netplugin failed: "{"level":"debug","msg":"[cni-net] Plugin wcn-net version .","time":"2023-09-25T18:30:26Z"}\n{"level":"debug","msg":"[net] Network interface: {Index:8 MTU:1500 Name:vEthernet (Ethernet) HardwareAddr:42:01:ac:17:a0:bb Flags:up|broadcast|multicast} with IP addresses: [fe80::b9e2:c457:141d:f89a/64 172.23.160.187/23]","time":"2023-09-25T18:30:26Z"}\n{"level":"debug","msg":"[net] Network interface: {Index:15 MTU:1500 Name:vEthernet (cbr0) HardwareAddr:00:15:5d:9e:0f:e1 Flags:up|broadcast|multicast} with IP addresses: [fe80::eb01:b4d0:6a0:4505/64 172.30.2.2/24]","time":"2023-09-25T18:30:26Z"}\n{"level":"debug","msg":"[net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP addresses: [::1/128 127.0.0.1/8]","time":"2023-09-25T18:30:26Z"}\n{"level":"debug","msg":"[cni-net] Plugin started.","time":"2023-09-25T18:30:26Z"}\n{"level":"debug","msg":"[cni-net] Processing DEL command with args {ContainerID:52ad56234561203338af081d24262de1a9b432456c942569f80d2459c40b831f Netns:7c806792-8457-4c13-a8c2-fc8a9c476d73 IfName:eth0 Args:K8S_POD_NAME=smb-server-monitoring-csi-cron-win-28261076-7jcsb;K8S_POD_INFRA_CONTAINER_ID=52ad56234561203338af081d24262de1a9b432456c942569f80d2459c40b831f;K8S_POD_UID=7f0ce019-37c7-4098-aee6-51936bbdfd88;IgnoreUnknown=1;K8S_POD_NAMESPACE=samba Path:C:\\etc\\kubernetes\\cni}","time":"2023-09-25T18:30:26Z"}\n{"level":"debug","msg":"[cni-net] Read network configuration \u0026{CniVersion:0.2.0 Name:l2bridge Type:sdnbridge Ipam:{Type: Environment: AddrSpace: Subnet:172.30.2.0/24 Address: QueryInterval: Routes:[{Dst:{IP:\u003cnil\u003e Mask:\u003cnil\u003e} GW:172.30.2.2}]} DNS:{Nameservers:[172.31.0.10] Domain: Search:[cluster.local] Options:[]} OptionalFlags:{LocalRoutePortMapping:false AllowAclPortMapping:false ForceBridgeGateway:false} RuntimeConfig:{PortMappings:[] DNS:{Servers:[172.31.0.10] Searches:[samba.svc.cluster.local svc.cluster.local cluster.local c.cpl-sta-l-app-devlab-01.internal google.internal us-west1-c.c.cpl-sta-l-app-devlab-01.internal] Options:[ndots:5]}} AdditionalArgs:[{Name:EndpointPolicy Value:[123 34 83 101 116 116 105 110 103 115 34 58 123 34 69 120 99 101 112 116 105 111 110 115 34 58 91 34 49 54 57 46 50 53 52 46 48 46 48 47 49 54 34 44 34 49 48 46 48 46 48 46 48 47 56 34 44 34 49 55 50 46 49 54 46 48 46 48 47 49 50 34 44 34 49 57 50 46 49 54 56 46 48 46 48 47 49 54 34 44 34 49 48 48 46 54 52 46 48 46 48 47 49 48 34 44 34 49 57 50 46 48 46 48 46 48 47 50 52 34 44 34 49 57 50 46 48 46 50 46 48 47 50 52 34 44 34 49 57 50 46 56 56 46 57 57 46 48 47 50 52 34 44 34 49 57 56 46 49 56 46 48 46 48 47 49 53 34 44 34 49 57 56 46 53 49 46 49 48 48 46 48 47 50 52 34 44 34 50 48 51 46 48 46 49 49 51 46 48 47 50 52 34 44 34 50 52 48 46 48 46 48 46 48 47 52 34 93 125 44 34 84 121 112 101 34 58 34 79 117 116 66 111 117 110 100 78 65 84 34 125]} {Name:EndpointPolicy Value:[123 34 83 101 116 116 105 110 103 115 34 58 123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 55 50 46 51 49 46 48 46 48 47 49 54 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 125 44 34 84 121 112 101 34 58 34 83 68 78 82 111 117 116 101 34 125]} {Name:EndpointPolicy Value:[123 34 83 101 116 116 105 110 103 115 34 58 123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 55 50 46 50 51 46 49 54 48 46 49 56 55 47 51 50 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 125 44 34 84 121 112 101 34 58 34 83 68 78 82 111 117 116 101 34 125]}]}.","time":"2023-09-25T18:30:26Z"}\n{"level":"debug","msg":"Substituting RuntimeConfig DNS Nameservers: [172.31.0.10]","time":"2023-09-25T18:30:26Z"}\n{"level":"debug","msg":"Substituting RuntimeConfig DNS Search: [samba.svc.cluster.local svc.cluster.local cluster.local c.cpl-sta-l-app-devlab-01.internal google.internal us-west1-c.c.cpl-sta-l-app-devlab-01.internal]","time":"2023-09-25T18:30:26Z"}\n{"level":"debug","msg":"Substituting RuntimeConfig DNS Options: [ndots:5]","time":"2023-09-25T18:30:26Z"}\n{"level":"debug","msg":"Parsing port mappings from []","time":"2023-09-25T18:30:26Z"}\n{"level":"debug","msg":"hcn::HostComputeNamespace::RemoveNamespaceEndpoint id=2a352669-639b-4d85-9b4d-7b459a39b19e","time":"2023-09-25T18:30:26Z"}\n{"level":"debug","msg":"hcn::HostComputeNamespace::ModifyNamespaceSettings id=7c806792-8457-4c13-a8c2-fc8a9c476d73","time":"2023-09-25T18:30:26Z"}\n" |
@ibabou just out of curiosity, this this issue being hit with an updated WS2019 version and containerd/1.7? From the previous github issues I linked, seems like folks didn't run into the issues they saw with containerd/1.7 . |
@kiashok We don't have an easy repro unfortunately, it seems to happen on the customer's cluster from time to time. Is there any additional debug info we can ask to enable/collect to be ready in case an occurence happened? with regards to containerd version, no all clusters are running on 1.6.X, we haven't moved yet to containerd 1.7. Is there a specific changes with containerd or hcsshim that can relate to this not occuring on 1.7 compared to 1.6? |
I've seen that density runs (high parallel pod creation) improve with 1.28 k8s and containerd/1.7 . Haven't dug into what set of commits might be helping with this. |
@ibabou by sdn CNI do you mean Azure CNI? Do you mean sdnbridge? What CNI is the cx using? |
@kiashok no, that's not an Azure CNI. this runs on GKE, and sdnbridge cni used: https://github.com/microsoft/windows-container-networking. The cni config settings is as follows: |
Hey @kiashok , any other updates on the further investigation? |
🔖 ADO 47828142 |
This issue has been open for 30 days with no updates. |
1 similar comment
This issue has been open for 30 days with no updates. |
Apologies for the radio silence @ibabou, is this still an issue you're hitting? If so, could you share what K8s version the customer is on and if they can reproduce the issue with WS 2022 version? At some point when we looked at this and suspected this might be related to the HNS, which we have shipped a lot fixes to in WS 2022. Also, as @kiashok mentioned in the thread, we have seen better pod density with 1.28 and containerd 1.7 and upgrading to this may reduce overall frequency of this issue in case it's related to resource density. |
Hey @fady-azmy-msft , thanks for circling back on this! Yeah, we've the customer still experiencing the issue, especially with high packed density. We have tried out containerd 1.6.24 but haven't seen improvements either - containerd 1.7 is still out of scope but we're considering to test out with it. I agree that seems HNS is suspected here. I'll follow-up on Windows 2022 usage, and ask our team to follow-up on validating if we can see improvements as compared to WS2019 LTSC. |
That sounds great @ibabou . I'll wait to hear from you if WS 2022 is still triggering the issue. |
This issue has been open for 30 days with no updates. |
Hey @ibabou, I'm going to close the issue since it's been a while for the repro; however once this occurs again please open this issue again and we'll look at this. Appreciate you highlighting this to us. |
Describe the bug
Pod sandbox creations go in hanging state (UNKNOWN), where creations CreateComputeSystem timeouts and cleanup of already disconnected shims continue to fail.
To Reproduce
Steps to reproduce the behavior:
It happens on nodes after running for a while, it could be at a certain point of overload but we haven't observed a very high CPU/Mem starvation when it occurs. It sometimes auto resolve on it's own, and sandboxes gets cleaned up & replaced with successfully running ones.
Expected behavior
A clear and concise description of what you expected to happen.
Timeouts are not expected, and cleanup should happen normally if it occurs
Configuration:
Additional context
We spotted a case and collected Container Platform ETW streams while it was happening, logs are shared with @MikeZappa87 (Michael Zappa).
Here is the observation with the captured instance, where it kept failing to create the sandbox:
ADD for this sandbox @ ~2023-09-25T18:27
The timeout (after 4 mins)
It errors as it fails to delete/cleanup shim
Cleanup error:
Subsequent calls to recreate will fail as name is still reserved (it seems failed sandbox is still not removed)
Trials to Stop the existing sandbox fails as well - seems it can't cleanup the network
Stays in a loop with both StopSandbox & CreateSandbox calls both failing in a similar way. and sandbox remains in UNKNOWN state
Error with timeout (@ 18:31:00):
The text was updated successfully, but these errors were encountered: