Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcr.microsoft.com/windows/servercore:10.0.20348.2340 BSODs Windows 2022 10.0.20348.1726, mcr.microsoft.com/windows/servercore:10.0.20348.1970 does not #502

Open
doctorpangloss opened this issue May 24, 2024 · 24 comments
Assignees
Labels
bug Something isn't working Windows on Kubernetes Windows Containers using Kubernetes

Comments

@doctorpangloss
Copy link

doctorpangloss commented May 24, 2024

Describe the bug
The latest mcr.microsoft.com/windows/servercore:10.0.20348.2340 BSODs (crashes) a Windows 2022 10.0.20348.1726 host.

This bug is about the latest images running on a non-latest host.

PS C:\Users\Administrator> Get-WinEvent -FilterHashtable @{LogName='System'; Id=1001; StartTime=[datetime]::Today} |
>>     ForEach-Object {
>>         [PSCustomObject]@{
>>             TimeCreated = $_.TimeCreated
>>             ProviderName = $_.ProviderName
>>             EventID = $_.Id
>>             DiagnosticID = $_.Properties[2].Value
>>             Message = ($_.Properties[0].Value -join " ")
>>         }
>>     } | Format-Table -AutoSize

TimeCreated           ProviderName                               EventID DiagnosticID                         Message
-----------           ------------                               ------- ------------                         -------
5/23/2024 10:00:38 PM Microsoft-Windows-WER-SystemErrorReporting    1001 fed6e402-0998-4987-a650-fda41d8ca074 0x0000000a (0x000000000000004c, 0x0000000000000002, 0x0000000000000001, 0xfffff8007f90b8ba)
5/23/2024 8:42:59 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 1122c493-02e4-4ad6-95a4-ceae345a1f2d 0x0000000a (0x0000029985891047, 0x0000000000000002, 0x0000000000000001, 0xfffff8063d30b8ba)
5/23/2024 8:26:54 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 9bdd89a7-57a3-435f-8854-c350170bcd70 0x0000003b (0x00000000c0000005, 0xfffff80134d0b8ba, 0xffffe60059073900, 0x0000000000000000)
5/23/2024 8:23:22 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 e5b56d98-e475-4976-8c5c-692334d986c1 0x0000000a (0x000000000000004c, 0x0000000000000002, 0x0000000000000001, 0xfffff8040cf0b8ba)
5/23/2024 8:04:51 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 bf75bc45-200e-407e-8611-02f55d16a4db 0x0000001e (0xffffffffc0000005, 0xfffff8041b828777, 0x0000000000000000, 0xffffffffffffffff)
5/23/2024 8:02:59 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 3aaf6864-d78e-4c3f-9ad6-001b6e2552c6 0x0000000a (0x000000000000004c, 0x0000000000000002, 0x0000000000000001, 0xfffff8064a50b8ba)
5/23/2024 7:54:08 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 f1dee8f5-52da-4899-a782-56e1aac43847 0x0000000a (0x00000000005fe047, 0x0000000000000002, 0x0000000000000001, 0xfffff80738f0b8ba)
5/23/2024 7:49:59 PM  Microsoft-Windows-WER-SystemErrorReporting    1001 c96ec495-4215-4def-903b-48262b4ef468 0x0000003b (0x00000000c0000005, 0xfffff8052830b8ba, 0xffffd380c81f3900, 0x0000000000000000)
Diagnostic ID Explanation
fed6e402-0998-4987-a650-fda41d8ca074 0x0000000a: IRQL_NOT_LESS_OR_EQUAL (Memory access violation)
1122c493-02e4-4ad6-95a4-ceae345a1f2d 0x0000000a: IRQL_NOT_LESS_OR_EQUAL (Memory access violation)
9bdd89a7-57a3-435f-8854-c350170bcd70 0x0000003b: SYSTEM_SERVICE_EXCEPTION (General system error)
e5b56d98-e475-4976-8c5c-692334d986c1 0x0000000a: IRQL_NOT_LESS_OR_EQUAL (Memory access violation)
bf75bc45-200e-407e-8611-02f55d16a4db 0x0000001e: KMODE_EXCEPTION_NOT_HANDLED (Kernel mode error)
3aaf6864-d78e-4c3f-9ad6-001b6e2552c6 0x0000000a: IRQL_NOT_LESS_OR_EQUAL (Memory access violation)
f1dee8f5-52da-4899-a782-56e1aac43847 0x0000000a: IRQL_NOT_LESS_OR_EQUAL (Memory access violation)
c96ec495-4215-4def-903b-48262b4ef468 0x0000003b: SYSTEM_SERVICE_EXCEPTION (General system error)
STACK_TEXT:
ffff9009`8a89e368 fffff800`7fa33a69 : 00000000`0000000a 00000000`0000004c 00000000`00000002 00000000`00000001 : nt!KeBugCheckEx
ffff9009`8a89e370 fffff800`7fa2f24c : ffff9009`8a89e800 00000000`00000000 ffff940c`7c15e478 fffff800`7f86526c : nt!setjmpex+0x9269
ffff9009`8a89e4b0 fffff800`7f90b8ba : ffffa98e`5a058a20 fffff800`7f84faeb 00000000`00000000 00000000`00000024 : nt!setjmpex+0x4a4c
ffff9009`8a89e640 fffff800`7f96394a : 00000000`00000000 ffff9009`00000000 00000000`00000000 ffffa98e`5a058a80 : nt!ExTryAcquireSpinLockExclusiveAtDpcLevel+0x3a
ffff9009`8a89e670 fffff800`7f963897 : 00000000`00000000 00000000`00000008 ffffa98e`77664520 ffff9009`8a89ea90 : nt!FsRtlChangeBackingFileObject+0xca
ffff9009`8a89e6b0 fffff800`81ea8c95 : 00000000`00000000 00000000`00000000 ffffa98e`5a0581b0 ffff9009`8a89ea90 : nt!FsRtlChangeBackingFileObject+0x17
ffff9009`8a89e6e0 fffff800`81ea2792 : ffffa98e`757a4010 ffff9009`8a89ea90 ffffa98e`757a4010 00000000`00000000 : Ntfs+0xe8c95
ffff9009`8a89e980 fffff800`7f9031f5 : ffffa98e`5a058030 ffffa98e`757a4010 ffff9009`8a89ec00 ffffa98e`77664520 : Ntfs+0xe2792
ffff9009`8a89ec00 fffff800`7b7767df : ffffa98e`77664500 ffff9009`8a89ecf0 ffff9009`8a89ecf9 fffff800`7b775463 : nt!IofCallDriver+0x55
ffff9009`8a89ec40 fffff800`7b7a95e4 : ffff9009`8a89ecf0 ffffa98e`757a43f8 ffffa98e`59c7fd20 00000000`00000000 : FLTMGR!FltIsCallbackDataDirty+0x40f
ffff9009`8a89ecb0 fffff800`7f9031f5 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : FLTMGR!FltQueryInformationFile+0x9c4
ffff9009`8a89ed60 fffff800`7fcff276 : ffffa98e`757a4440 00000000`00000000 ffff9009`8a89f001 00000000`00001040 : nt!IofCallDriver+0x55
ffff9009`8a89eda0 fffff800`7fd98887 : 00000000`00000000 ffffa98e`5e933a20 a98e6562`4490d2bd ffffa98e`656244c0 : nt!SePrivilegeCheck+0x1a76
ffff9009`8a89ef60 fffff800`7fc5b215 : fffff800`7fd987c0 ffff9009`8a89f0d0 ffffa98e`551f9400 ffffa98e`656244c0 : nt!NtSetSecurityObject+0xab7
ffff9009`8a89efd0 fffff800`7fc5a6b1 : 00000000`00000000 ffff9009`8a89f200 00000000`00001040 ffffa98e`551f9400 : nt!ObOpenObjectByNameEx+0xd55
ffff9009`8a89f170 fffff800`7fcd0cc1 : 00000000`00000000 00000000`00000000 ffffa98e`5e933a20 000000c0`02299d58 : nt!ObOpenObjectByNameEx+0x1f1
ffff9009`8a89f2a0 fffff800`7fcd0469 : 000000c0`02299d10 00000000`00100080 000000c0`02299d58 000000c0`02299d20 : nt!NtCreateFile+0x8d1
ffff9009`8a89f360 fffff800`7fa33185 : 00000000`00000126 000000c0`00bc56c0 000000c0`00680000 00000000`00000000 : nt!NtCreateFile+0x79
ffff9009`8a89f3f0 00007ff8`5005ff14 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!setjmpex+0x8985
00000043`8c3ffaf8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00007ff8`5005ff14

If you tell me how to invoke kd in a way that downloads symbols I can show the whole memory dump.

To Reproduce

This deployment will cause a blue screen:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: x
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: x
  template:
    metadata:
      labels:
        app: x
    spec:
      nodeSelector:
        kubernetes.io/os: windows
      containers:
        - name: some-container
          image: mcr.microsoft.com/windows/servercore:10.0.20348.2340
          securityContext:
            windowsOptions:
              runAsUserName: ContainerAdministrator
          command:
            - "C:/Windows/System32/WindowsPowerShell/v1.0/powershell.exe"
            - "-Command"
          args:
            - |
              $test = "1"

on a Windows 10.0.20348.1726 node.

On Windows 2022 10.0.20348.2227 + Docker, this does not reproduce.

Expected behavior
It shouldn't crash.

Configuration:

  • Edition: Windows Server 2022 Data Center
  • Base Image being used: Windows Server Core
  • Container engine: containerd 1.7.16 (no impact compared to 1.7.0)
  • kubernetes 1.26.2

Additional context
I am using this version of Windows due to projectcalico/calico#8529 and cannot update until it is fixed.

@doctorpangloss doctorpangloss added bug Something isn't working triage New and needs attention labels May 24, 2024
@jsturtevant
Copy link

Just pointing out that this appears to be an issue when using older WS host and new WS image layers. I missed that the first time I read through it.

@ntrappe-msft ntrappe-msft added the Windows on Kubernetes Windows Containers using Kubernetes label May 24, 2024
@avin3sh
Copy link

avin3sh commented May 29, 2024

In our case we are seeing BSOD with Windows Server 2022 Worker Nodes on build 10.0.20348.2461 running a PowerShell image built from 10.0.20348.2322 (the tag below corresponds to sha256:fef9ce2b93ad3b09bd51f60bba3476fafd4d9dc46260de9aff6e5aff4bd142f5).

This implies this also affects hosts where WS build version is greater than image WS build version!

To reproduce this, on a worker node that isn't running any other containers(this somehow makes BSOD more likely), create a pod with the following container spec:

      containers:
      - image: mcr.microsoft.com/powershell:lts-nanoserver-ltsc2022
        imagePullPolicy: IfNotPresent
        command: ["pwsh.exe"]
        args: ["-noprofile", "-noninteractive", "-executionpolicy", "bypass", "-command", "[System.Threading.Thread]::Sleep([System.Threading.Timeout]::Infinite)"]
        name: node-ds
        resources:
          limits:
            cpu: 1
            ephemeral-storage: 512M
            memory: 1G
          requests:
            cpu: 250m
            ephemeral-storage: 128M
            memory: 512M
        securityContext:
          runAsNonRoot: true

The worker node does not crash immediately but after a few days.

@ntrappe-msft ntrappe-msft removed the triage New and needs attention label Jun 11, 2024
@Howard-Haiyang-Hao
Copy link
Contributor

@doctorpangloss and @avin3sh, I wonder if you could share the crash dump files with me. Thanks!

@doctorpangloss
Copy link
Author

Please advise how to send the crash dumps, I appreciate investigating this issue.

@Howard-Haiyang-Hao
Copy link
Contributor

I have the following instructions for sharing a large file through Azure storage. Please let me know if this approach works for you:

How to Use Azure Blob Storage (1).docx

Thank you!

@doctorpangloss
Copy link
Author

I have the following instructions for sharing a large file through Azure storage. Please let me know if this approach works for you:

How to Use Azure Blob Storage (1).docx

Thank you!

thanks for this. memory dumps from this machine will contain sensitive information. I have an AWS presigned URL I can share, what's the best way to send it to you?

@Howard-Haiyang-Hao
Copy link
Contributor

hhao@microsoft.com

@doctorpangloss
Copy link
Author

hhao@microsoft.com

sent

@Howard-Haiyang-Hao
Copy link
Contributor

Thank you for sharing the dump file. I’ll keep you posted.

@Howard-Haiyang-Hao
Copy link
Contributor

The bugcheck occurred due to the operation lock cmpxchg dword ptr [rbx],ecx, where rbx was set to 000000000000004c. Referencing [rbx] triggered the bugcheck. The session object appears to have been corrupted. It would be beneficial if you could test this scenario with a more recent build.

17: kd> vertarget
Windows 10 Kernel Version 20348 MP (32 procs) Free x64
Product: Server, suite: TerminalServer DataCenter SingleUserTS
Edition build lab: 20348.859.amd64fre.fe_release_svc_prod2.220707-1832
Kernel base = 0xfffff8007f600000 PsLoadedModuleList = 0xfffff80080233e00
Debug session time: Thu May 23 21:59:51.989 2024 (UTC - 7:00)
System Uptime: 0 days 1:17:02.015

17: kd> kc

Call Site

00 nt!KeBugCheckEx
01 nt!KiBugCheckDispatch
02 nt!KiPageFault
03 nt!ExTryAcquireSpinLockExclusiveAtDpcLevel
04 nt!MiTryAcquireSpinLockExclusiveAtDpc
05 nt!MiTryLockControlAreaExclusiveAtDpc
06 nt!MmChangeSectionBackingFile
07 nt!FsRtlChangeBackingFileObject
08 Ntfs!NtfsUpdateBackingFileObject
09 Ntfs!NtfsCommonCreate
0a Ntfs!NtfsFsdCreate

@avin3sh
Copy link

avin3sh commented Jul 2, 2024

It would be beneficial if you could test this scenario with a more recent build.

Do you mean testing this on a host with latest CU or is there a new nanoserver image that has been released that should address this ? I see the latest nanoserver:lts2022-amd64 image has been created on 2024-06-19, which appears out of band -- as typically images generally seem to be created on the first Tuesday of the month -- would you confirm if this image is intended to fix the problem observed here ?

@avin3sh
Copy link

avin3sh commented Jul 2, 2024

Nevermind, the June 19/20 image seem to address something else - https://support.microsoft.com/en-au/topic/june-20-2024-kb5041054-os-build-20348-2529-out-of-band-b746ffbd-934e-42ac-9c66-ed0636edf7f1 (unless it is related to the problem described here).

I am still curious which particular version address the problem my comment in #502 (comment) was in reference to test done on a host with latest CU and latest image.

@Howard-Haiyang-Hao
Copy link
Contributor

@avin3sh, could you please provide the crash dump if it's available? I'm interested in determining if it's the same issue that @doctorpangloss encountered.

@avin3sh
Copy link

avin3sh commented Jul 3, 2024

I have updated my setup to use the latest windows/servercore@sha256:97bc51b0ec25220856ac6351d6f3f81983aaf8623141e0320ca417d6ef2ad89c image (published on June 20). Please give me few days of time to get back to you with minidump file.

If I don't see the nodes crashing then the issue would be with KB5036909 / build 20348.2402 or earlier images.

@doctorpangloss
Copy link
Author

doctorpangloss commented Jul 10, 2024

I can confirm that on these hosts:

10.0.20348.1547
10.0.20348.1607
10.0.20348.1726
10.0.20348.1787
10.0.20348.1906
10.0.20348.1970
10.0.20348.2227

containers based on mcr.microsoft.com/windows/servercore:10.0.20348.2529 do not crash.

@avin3sh
Copy link

avin3sh commented Jul 11, 2024

We saw BSOD again today on a worker node where the host had 10.0.20348.2527 CU applied. However this is a different Kubernetes cluster that had mix of workloads, including a servercore workload containing image built from KB5034770 / 20348.2322 that was running on this particular host.

I am going to email you the crash minidump file @Howard-Haiyang-Hao, on hhao@microsoft.com, as it is sensitive and I do not want to share it here.

However it looks like without knowing what combination of host build and image build trigger this problem, there is no easy way to mitigate it in a production scenario -- forcing the image of every running workload to be rebuilt with the latest is non-ideal.

@avin3sh
Copy link

avin3sh commented Jul 16, 2024

I saw another crash recently, the call site info is exactly same as the crashdump file I shared earlier with you

4: kd> kc
 # Call Site
00 nt!KeBugCheckEx
01 nt!ExpReleaseResourceSharedForThreadLite
02 nt!ExReleaseResourceLite
03 Ntfs!NtfsReleaseForCreateSection
04 nt!FsRtlReleaseFile
05 nt!MiShareExistingControlArea
06 nt!MiCreateImageOrDataSection
07 nt!MiCreateSection
08 nt!MmCreateSpecialImageSection
09 nt!NtCreateUserProcess
0a nt!KiSystemServiceCopyEnd
0b 0x0

Whats frightening is that the 20348.2527 based image that I thought might be source of the problem was not running on this host. The only two images that were running were based off 20348.2227 (January CU) and 20348.2529 (June CU). @Howard-Haiyang-Hao let me know if you need any additional details from me or if there is a workaround that you can share with us to prevent the Windows workers from crashing.

@avin3sh
Copy link

avin3sh commented Aug 5, 2024

@Howard-Haiyang-Hao did you get a chance to look further into this ?

Because of a Windows gMSA issue in #405 (comment) we have been forced to run container images built with March 2024 CU or prior. Without having more details on reasons behind these mysterious crashes, or the original issue getting resolved, we run the risk of unpredictable crashes in our production cluster.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@Howard-Haiyang-Hao, please provide an update or close this issue.

@avin3sh
Copy link

avin3sh commented Sep 10, 2024

Any update here @Howard-Haiyang-Hao

Copy link
Contributor

This issue has been open for 30 days with no updates.
@Howard-Haiyang-Hao, please provide an update or close this issue.

3 similar comments
Copy link
Contributor

This issue has been open for 30 days with no updates.
@Howard-Haiyang-Hao, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@Howard-Haiyang-Hao, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@Howard-Haiyang-Hao, please provide an update or close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Windows on Kubernetes Windows Containers using Kubernetes
Projects
None yet
Development

No branches or pull requests

5 participants