Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROD] Performance issue - extremely slow loading of riskwatch data on GO - please investigate #2184

Closed
nanometrenat opened this issue Jun 26, 2024 · 9 comments
Assignees

Comments

@nanometrenat
Copy link
Contributor

Issue

Risk watch API calls are taking much too long - like, 2 whole minutes to load the "Countries by Risk" data.
For example, for Africa, I went to https://go.ifrc.org/regions/0/risk-watch/seasonal and the page itself loaded quickly, however the Countries by Risk was just showing as loading. Looking at devtools I can see that
https://go-risk.northeurope.cloudapp.azure.com/api/v1/seasonal/?region=0 and
https://go-risk.northeurope.cloudapp.azure.com/api/v1/risk-score/?region=0&limit=9999
each took two mins
See screenshots from Devtools below

Similarly, if I am on the Imminent events page and select one of the countries' events then it takes > 8 seconds to load that one event (though in this case I can see it's queuing for a while before it goes, not sure what that means)
https://go.ifrc.org/regions/0/risk-watch/imminent page - calls https://go-risk.northeurope.cloudapp.azure.com/api/v1/pdc/99638/exposure/ - took

I have been doing Teams calls etc. on this same internet connection, and also using other parts of GO fine, so not sure why this bit of GO is so slow.

Thanks for your help investigating!
cc @justinginnetti

Screenshots etc.

image
image
I have attached my .har file in Teams if useful for investigating.

image

Expected behaviour

Not sure of our SLA for API responses these days but I think this is too long in any case!

Thanks loads

@tovari
Copy link

tovari commented Jun 28, 2024

@thenav56, @szabozoltan69, I'm not sure, if the recent disk storage issue could be the reason for this?

@nanometrenat, do you still experience such long response times?

@szabozoltan69
Copy link
Contributor

szabozoltan69 commented Jun 28, 2024

@thenav56, @szabozoltan69, I'm not sure, if the recent disk storage issue could be the reason for this?

It could be the reason. After having more space the mentioned two queries, like this, run fast.

@nanometrenat
Copy link
Contributor Author

nanometrenat commented Jun 28, 2024

@nanometrenat, do you still experience such long response times?

Hi there, it seems fine today - loading quickly like I would expect. Does the timing of the ticket correlate with when there were storage space issues? If so then that is presumably a valid explanation! Thanks

@szabozoltan69
Copy link
Contributor

Does the timing of the ticket correlate with when there were storage space issues?

Yes, I think so. Though there was not made any tickets for that, only discussed with @thenav56 .

@nanometrenat
Copy link
Contributor Author

Great that incident earlier this week was resolved swiftly!

@szabozoltan69 @thenav56 is the root cause also resolved? i.e. monitoring in place so we get alerted so can fix it in advance next time? If so then I will happily close this ticket - thanks again

@thenav56
Copy link
Member

Hey @nanometrenat @szabozoltan69 @tovari,

We had some issues with the background tasks running on the same server as the API server. A memory leak in the background tasks affected the API server. We've added memory usage limits to the workers, which should fix the issue.

We also added swap, which impacted disk storage. This was fixed by using the temporary disk provided by Azure, as suggested by @szabozoltan69.

We've also been working on fixing the memory leak and are currently testing this in nightly. We've integrated Sentry profiling and cron monitoring, and we'll be pushing these changes to staging and production soon.

Let's keep this ticket open for now. Once we've pushed the changes to production, we can revisit and close it 😄

@thenav56
Copy link
Member

Update:
We have pushed the memory leak fix and integrated sentry monitoring and performance to the Risk module.

We can now use sentry to track and fix performance issues.
image

Also, added health-check to track running intances state
https://go-risk.northeurope.cloudapp.azure.com/health-check/

@szabozoltan69
Copy link
Contributor

https://go-risk.northeurope.cloudapp.azure.com/health-check/

Amazing...
Huge appreciation, @thenav56 !

@nanometrenat
Copy link
Contributor Author

Thanks @thenav56 - brilliant news!

Closing this ticket on the basis that the underlying issue has been resolved and also monitoring has been added. Thanks once again to all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants