Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rootless container libpod/tmp/persist directories not cleaned up, fill up tmpfs #25291

Closed
taistejlaiho opened this issue Feb 11, 2025 · 3 comments · Fixed by #25297
Closed

Rootless container libpod/tmp/persist directories not cleaned up, fill up tmpfs #25291

taistejlaiho opened this issue Feb 11, 2025 · 3 comments · Fixed by #25297
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@taistejlaiho
Copy link

taistejlaiho commented Feb 11, 2025

Issue Description

We're running Quadlet-based rootless Python/Django containers on separate test and production servers on Ubuntu 22.04, AMD64, with Podman and its dependencies built from source or downloaded as release binaries from GitHub, as applicable. Today, upon doing a CI deploy to the test server, the job failed with this error:

Error: writing to file "/run/user/1001/containers/auth.json": open /run/user/1001/containers/.tmp-auth.json3577526995: no space left on device

The part of the CI job that failed was a container registry login. I went to check on the server and saw that the tmpfs at /run/user/1001 (the UID under which the rootless containers run) was halfway full, occupying about 340 MB. I don't know why the error said that space had run out, but then I don't know the inner workings of tmpfs. The normal space usage should be in the kilobytes, not the hundreds of MBs, so there was clearly a problem.

Looking closer, the directory /run/user/1001/libpod/tmp/persist had tens of thousands of directories with 64-character hexadecimal names, corresponding to current or past container IDs of our application user. Nearly all of them contained a single 1-byte file called exit with the 0 character, and nothing else. Stopping containers, deleting the dirs and starting containers back up again worked, and the CI job was retried succesfully.

We normally run ten containers on the server 24/7. Upon starting up, all these containers gained a directory there corresponding to their ID. None of them had exit files, which made sense as none had yet exited. When I stopped a container, the exit file would appear, and the directory would stick around. Nothing seemed to be cleaning it up. No error was printed in the journalctl output of the container service regarding the inability to clean up the directory.

Worryingly, more directories and exit files kept being created at a constant rate without me restarting any containers.

Then I remembered that this server is running several containers that are started up via systemd timers to act as cron jobs. They all run successfully to completion based on their journalctl --user -u output, corresponding to 0 in the exit files. And it's these that are really filling up the tmpfs, as they get run hundreds of times every day.

This is only happening on the test server, not production, despite both running the same containers, including systemd-timed containers, the same OS version and an application user configured the same way.

The salient difference is that the test server is running nearer up-to-date versions of Podman + dependencies whereas production has older versions. So it seems that some regression has been introduced in one of the components since when Podman 4.8.3 was current.

Test server, exhibiting the issue:

  • Podman 5.2.1
  • conmon 2.1.12
  • netavark 1.12.2
  • aardvark-dns 1.12.1
  • crun 1.16.1

Production server, NOT exhibiting the issue:

  • Podman 4.8.3
  • conmon 2.1.10
  • netavark 1.9.0
  • aardvark-dns 1.9.0
  • crun 1.12

Steps to reproduce the issue

I don't know if this is universally reproducible outside our environment, but:

  1. Run a rootless Quadlet-based container as an unprivileged user with user mode systemd on a Ubuntu 22.04 AMD64 server with Podman 5.2.1
  2. Watch as dirs and files accumulate under /run/user/[uid]/libpod/tmp/persist, corresponding to the IDs of the containers even after their exit, eventually filling the tmpfs.

Describe the results you received

Containers have their /run/user/[uid]/libpod/tmp/persist/[container ID] tmpfs dirs left over after exiting successfully (exit code 0).

Describe the results you expected

Any directories and files created under /run/user/[uid]/libpod/tmp/persist would get cleaned up as containers exit.

podman info output

Note: this is from the test server. The production server didn't seem to have relevant differences outside component versions.

host:
  arch: amd64
  buildahVersion: 1.37.1
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: Unknown
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.1.12, commit: unknown'
  cpuUtilization:
    idlePercent: 92.85
    systemPercent: 1.15
    userPercent: 6
  cpus: 4
  databaseBackend: boltdb
  distribution:
    codename: jammy
    distribution: ubuntu
    version: "22.04"
  eventLogger: journald
  freeLocks: 2037
  hostname: <redacted>-staging
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
  kernel: 5.15.0-130-generic
  linkmode: dynamic
  logDriver: journald
  memFree: 1377988608
  memTotal: 8322985984
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: Unknown
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.12.1
    package: Unknown
    path: /usr/libexec/podman/netavark
    version: netavark 1.12.2
  ociRuntime:
    name: crun
    package: Unknown
    path: /usr/bin/crun
    version: |-
      crun version 1.16.1
      commit: afa829ca0122bd5e1d67f1f38e6cc348027e3c32
      rundir: /run/user/1001/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: /usr/local/bin/pasta
    package: Unknown
    version: |
      pasta unknown version
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: false
    path: /run/user/1001/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: ""
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 312365056
  swapTotal: 536866816
  uptime: 1187h 1m 45.00s (Approximately 49.46 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  docker.io:
    Blocked: false
    Insecure: false
    Location: docker.io
    MirrorByDigestOnly: false
    Mirrors: null
    Prefix: docker.io
    PullFromMirror: ""
  search:
  - docker.io
store:
  configFile: /home/appuser/.config/containers/storage.conf
  containerStore:
    number: 10
    paused: 0
    running: 10
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/appuser/.local/share/containers/storage
  graphRootAllocated: 168488570880
  graphRootUsed: 37740863488
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 19
  runRoot: /tmp/containers-user-1001/containers
  transientStore: false
  volumePath: /home/appuser/.local/share/containers/storage/volumes
version:
  APIVersion: 5.2.1
  Built: 1724236924
  BuiltTime: Wed Aug 21 13:42:04 2024
  GitCommit: ""
  GoVersion: go1.22.5
  Os: linux
  OsArch: linux/amd64
  Version: 5.2.1

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

No

Additional environment details

Linode VPS.

Additional information

No response

@taistejlaiho taistejlaiho added the kind/bug Categorizes issue or PR as related to a bug. label Feb 11, 2025
@taistejlaiho
Copy link
Author

This discussion was for a similar, though not identical, issue in 2023 with no resolution, but only last week the user @gee456 posted in the comments that they'd had this exact issue of the same directory filling up.

@mheon
Copy link
Member

mheon commented Feb 11, 2025

It looks like it's the exit directories from Conmon, which for some reason aren't being cleaned up. We should probably be removing it when we clean up the OCI runtime.

@mheon
Copy link
Member

mheon commented Feb 11, 2025

#25297 to fix.

openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/podman that referenced this issue Feb 11, 2025
This seems to have been added as part of the cleanup of our
handling of OOM files, but code was never added to remove it, so
we leaked a single directory with an exit file and OOM file per
container run. Apparently have been doing this for a while - I'd
guess since March of '23 - so I'm surprised more people didn't
notice.

Fixes containers#25291

Signed-off-by: Matt Heon <mheon@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants