Aggressive health checks giving false negatives #229
Replies: 4 comments
-
Can you post the traefik logs? The new health check approach probably won't help here since that's just for startup. |
Beta Was this translation helpful? Give feedback.
-
Sure, let me know if you need anything else. I've attached a portion of the log that starts with the server being healthy, followed by the failed heath check and server restart, eventually ending up being healthy again after the restart. The interesting line from the logs is copied below with the failed check because "server closed idle connection". For other failures, the reason is sometimes "connection refused". Usually the server will run for maybe 2-12 hours before one of these failures.
I'm curious if it is possible to configure the traefik process to not kill the container immediately, but to check if it passes the subsequent health check a second later. I'm not sure how to do this, but would be an interesting experiment. |
Beta Was this translation helpful? Give feedback.
-
I think one possibility could also be not to use Traefik at all on tiny installations. We only use Traefik to seal the gap on deploys. But maybe for installations that small, it's not necessary. And removing Traefik entirely, even if that means accepting a short stoppage on deploy, could be worth it. |
Beta Was this translation helpful? Give feedback.
-
Inspecting the logs a little more closely I noticed an exit code of 137 when the container goes down and apparently that means the container was shut down due to lack of memory. That makes a lot more sense, and I guess I was flying a little too close to the sun on a 512mb droplet. Even though the machine showed ~20% memory free, I'm guessing the container management process may be a little stricter with its allocations. I guess running only the rails container without the traefik container freed up enough memory not to trip the shutdown. So I apologize for the false alarm. I bumped up to a 1gb VM and I'll report back if that doesn't indeed fix the problem. |
Beta Was this translation helpful? Give feedback.
-
I'm using a MRSK deployed app on Digital Ocean with a low-spec droplet. Looking at the droplet metrics, I'm still well under the maximums on RAM/CPU/Disk. I've noticed that my app container gets shut down by the traefik health monitor every few hours due to a failure to connect to the uptime check which occurs every second. Looking at my app container logs, there is nothing suggesting an issue occurred within the rails app or container itself. I think it just failed to respond to the uptime check within the timeframe that traefik is expecting, and immediately gets shut down. This leads to another issue with the server.pid file still existing in the container and it ending up in an endless restart loop. I've since modified the entrypoint to clean the existing pid file if it exists.
To check if there was actually something problematic happening within the app container, I took traefik out of the equation and just ran the container itself with the port exposed directly. The app has remained up and working with no interruptions or shut downs for days. This leads me to believe that the container was getting shut down by traefik, even though it was probably fine - albeit maybe too slow to respond to the uptime check every so often. This may just be a side effect of using low powered VPCs on shared resources.
Recently, it looks like there is a new PR to change the health check process:
#219
Maybe this change will alleviate the issue I've described above. If not, then we might want to look into allowing a modification of the health check to allow for slower response times or tolerate the occasional failure due to slow resources.
Beta Was this translation helpful? Give feedback.
All reactions