Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Envoy] Envoy proxy healthchecks #922

Open
klapkov opened this issue Mar 25, 2024 · 7 comments
Open

[Envoy] Envoy proxy healthchecks #922

klapkov opened this issue Mar 25, 2024 · 7 comments
Assignees

Comments

@klapkov
Copy link
Contributor

klapkov commented Mar 25, 2024

Envoy proxy healthchecks

Summary

In the past we have observed cases, where an application is running, but does not accept any connections. When we looked into it, the app healthcheck was passing and the envoy proxy was running as well, but no requests were reaching the app. This leads to this loop:

  • Gorouter unable to open a connection to the diego cell.
  • Gorouter prunes the endpoint
  • Since the app healthcheck passes, the endpoint gets re-registered

This is why we started to look into potential ways to do some sort of healthchecking on the proxy. The best option we currently see is modifying the app healthcheck in a way that also checks the proxy. Currently it uses only the app port. We can add a parallel check that also does the same trough the proxy port. The proxy will then redirect the request to the app and we will receive a response. This of course means two times more healthchecking requests to the app, but this should not have any significant impact.

Of course this extra check functionality could be enabled with a flag in the executor, so it can be used only if needed.

Please let me know what you think on the topic. I think this topic has been discussed in the past and maybe someone could give some context why it was never implemented.

Diego repo

https://github.com/cloudfoundry/executor
https://github.com/cloudfoundry/healthcheck

@Viktor-Velkov
Copy link

Viktor-Velkov commented Jan 29, 2025

New info:


Adding envoy proxy liveness check. With this new functionality when the envoy stops accepting TCP connections the health check will fail and the app will be restarted.
With those 2 PR's:
cloudfoundry/executor#110
#985

The changes were tested on test environment and it is visible that there are 3 envoy TCP liveness healthchecks:

Image

The setup we tested was on our environment with the newly implemented envoy liveness check and iptable rule on the container side to drop everything with destination port 61001(envoy), which causes timeout on gorouter side.

iptables -A INPUT -p tcp --dport 61001 -j DROP

After the execution of the iptable rule on the container which drop destination port 61001 we've received the correct error message and then the app was restarted. Which proves that the newly implemented logic is working:

Image

Feedback is highly appreciated.

@mariash
Copy link
Member

mariash commented Feb 11, 2025

Hi @Viktor-Velkov,

Please provide more information:

  1. What kind of health check do you use for app? process or port?
  2. Do you think there is an issue with the app or envoy proxy not accepting the connection? How do you know this?
  3. Is your desired behavior is to restart an app when it fails to accept connection?

If you would use port health check type your application should be restarted if it fails to accept connection on that port.

@Viktor-Velkov
Copy link

Viktor-Velkov commented Feb 21, 2025

Hi @mariash,

Sorry for the late response, but we had to test one more functionality in regards communication between app and the envoy proxy.

  1. What kind of health check do you use for app? process or port?
  • In our case we are focusing on liveness check specifically for the Envoy proxy an we are doing a port based check.
  1. Do you think there is an issue with the app or envoy proxy not accepting the connection? How do you know this?
  • The app is on a running state and the app liveness check (port or http) are passing but the proxy is not accepting any connection from outside the container and the app is unreachable. There is a very rare scenario in which for some reason the proxy goes into unresponsive state We’ve checked the application logs and Envoy logs the app state is “running” but the proxy didn’t accept connections.
  1. Is your desired behaviour is to restart an app when it fails to accept connection?
  • The desired behaviour is to restart the app if the new liveness check fails(if envoy becomes unresponsive).
  1. If you would use port health check type your application should be restarted if it fails to accept connection on that port.
  • The current health check run inside the container and bypass the proxy. There is currently no way of knowing if the proxy is responding to connections.

@PlamenDoychev
Copy link

Hi @mariash, @ameowlia ,
Do you have any other feedback, comments on this proposal?

@PlamenDoychev
Copy link

BTW: @Viktor-Velkov I just noticed that we are adding a new spec property. Make sure you add this property to windows spec file as well.
I currently assume this PR will break drift tests.

@Viktor-Velkov
Copy link

I totally forgot about rep_windows properties! I will add them and test the changes.
After the successful execution of the tests locally I will add the windows changes into the PR.

@Viktor-Velkov
Copy link

All done. The tests are passing and the two changes in regards two rep_windows are already into the PR which is linked into the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants