-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mcr.microsoft.com/windows/servercore:10.0.20348.2340 BSODs Windows 2022 10.0.20348.1726, mcr.microsoft.com/windows/servercore:10.0.20348.1970 does not #502
Comments
Just pointing out that this appears to be an issue when using older WS host and new WS image layers. I missed that the first time I read through it. |
In our case we are seeing BSOD with Windows Server 2022 Worker Nodes on build This implies this also affects hosts where WS build version is greater than image WS build version! To reproduce this, on a worker node that isn't running any other containers(this somehow makes BSOD more likely), create a pod with the following container spec: containers:
- image: mcr.microsoft.com/powershell:lts-nanoserver-ltsc2022
imagePullPolicy: IfNotPresent
command: ["pwsh.exe"]
args: ["-noprofile", "-noninteractive", "-executionpolicy", "bypass", "-command", "[System.Threading.Thread]::Sleep([System.Threading.Timeout]::Infinite)"]
name: node-ds
resources:
limits:
cpu: 1
ephemeral-storage: 512M
memory: 1G
requests:
cpu: 250m
ephemeral-storage: 128M
memory: 512M
securityContext:
runAsNonRoot: true The worker node does not crash immediately but after a few days. |
@doctorpangloss and @avin3sh, I wonder if you could share the crash dump files with me. Thanks! |
Please advise how to send the crash dumps, I appreciate investigating this issue. |
I have the following instructions for sharing a large file through Azure storage. Please let me know if this approach works for you: How to Use Azure Blob Storage (1).docx Thank you! |
thanks for this. memory dumps from this machine will contain sensitive information. I have an AWS presigned URL I can share, what's the best way to send it to you? |
sent |
Thank you for sharing the dump file. I’ll keep you posted. |
The bugcheck occurred due to the operation 17: kd> vertarget 17: kd> kc Call Site00 nt!KeBugCheckEx |
Do you mean testing this on a host with latest CU or is there a new nanoserver image that has been released that should address this ? I see the latest |
Nevermind, the June 19/20 image seem to address something else - https://support.microsoft.com/en-au/topic/june-20-2024-kb5041054-os-build-20348-2529-out-of-band-b746ffbd-934e-42ac-9c66-ed0636edf7f1 (unless it is related to the problem described here). I am still curious which particular version address the problem my comment in #502 (comment) was in reference to test done on a host with latest CU and latest image. |
@avin3sh, could you please provide the crash dump if it's available? I'm interested in determining if it's the same issue that @doctorpangloss encountered. |
I have updated my setup to use the latest If I don't see the nodes crashing then the issue would be with |
I can confirm that on these hosts:
containers based on |
We saw BSOD again today on a worker node where the host had I am going to email you the crash minidump file @Howard-Haiyang-Hao, on However it looks like without knowing what combination of host build and image build trigger this problem, there is no easy way to mitigate it in a production scenario -- forcing the image of every running workload to be rebuilt with the latest is non-ideal. |
I saw another crash recently, the call site info is exactly same as the crashdump file I shared earlier with you
Whats frightening is that the |
@Howard-Haiyang-Hao did you get a chance to look further into this ? Because of a Windows gMSA issue in #405 (comment) we have been forced to run container images built with March 2024 CU or prior. Without having more details on reasons behind these mysterious crashes, or the original issue getting resolved, we run the risk of unpredictable crashes in our production cluster. |
This issue has been open for 30 days with no updates. |
Any update here @Howard-Haiyang-Hao |
This issue has been open for 30 days with no updates. |
3 similar comments
This issue has been open for 30 days with no updates. |
This issue has been open for 30 days with no updates. |
This issue has been open for 30 days with no updates. |
Describe the bug
The latest
mcr.microsoft.com/windows/servercore:10.0.20348.2340
BSODs (crashes) a Windows 2022 10.0.20348.1726 host.This bug is about the latest images running on a non-latest host.
If you tell me how to invoke
kd
in a way that downloads symbols I can show the whole memory dump.To Reproduce
This deployment will cause a blue screen:
on a Windows 10.0.20348.1726 node.
On Windows 2022 10.0.20348.2227 + Docker, this does not reproduce.
Expected behavior
It shouldn't crash.
Configuration:
Additional context
I am using this version of Windows due to projectcalico/calico#8529 and cannot update until it is fixed.
The text was updated successfully, but these errors were encountered: