[PROD] Performance issue - extremely slow loading of riskwatch data on GO - please investigate #2184

nanometrenat · 2024-06-26T18:45:56Z

Issue

Risk watch API calls are taking much too long - like, 2 whole minutes to load the "Countries by Risk" data.
For example, for Africa, I went to https://go.ifrc.org/regions/0/risk-watch/seasonal and the page itself loaded quickly, however the Countries by Risk was just showing as loading. Looking at devtools I can see that
https://go-risk.northeurope.cloudapp.azure.com/api/v1/seasonal/?region=0 and
https://go-risk.northeurope.cloudapp.azure.com/api/v1/risk-score/?region=0&limit=9999
each took two mins
See screenshots from Devtools below

Similarly, if I am on the Imminent events page and select one of the countries' events then it takes > 8 seconds to load that one event (though in this case I can see it's queuing for a while before it goes, not sure what that means)
https://go.ifrc.org/regions/0/risk-watch/imminent page - calls https://go-risk.northeurope.cloudapp.azure.com/api/v1/pdc/99638/exposure/ - took

I have been doing Teams calls etc. on this same internet connection, and also using other parts of GO fine, so not sure why this bit of GO is so slow.

Thanks for your help investigating!
cc @justinginnetti

Screenshots etc.

I have attached my .har file in Teams if useful for investigating.

Expected behaviour

Not sure of our SLA for API responses these days but I think this is too long in any case!

Thanks loads

tovari · 2024-06-28T14:16:35Z

@thenav56, @szabozoltan69, I'm not sure, if the recent disk storage issue could be the reason for this?

@nanometrenat, do you still experience such long response times?

szabozoltan69 · 2024-06-28T14:28:08Z

@thenav56, @szabozoltan69, I'm not sure, if the recent disk storage issue could be the reason for this?

It could be the reason. After having more space the mentioned two queries, like this, run fast.

nanometrenat · 2024-06-28T15:02:43Z

@nanometrenat, do you still experience such long response times?

Hi there, it seems fine today - loading quickly like I would expect. Does the timing of the ticket correlate with when there were storage space issues? If so then that is presumably a valid explanation! Thanks

szabozoltan69 · 2024-06-28T15:05:53Z

Does the timing of the ticket correlate with when there were storage space issues?

Yes, I think so. Though there was not made any tickets for that, only discussed with @thenav56 .

nanometrenat · 2024-06-28T15:37:00Z

Great that incident earlier this week was resolved swiftly!

@szabozoltan69 @thenav56 is the root cause also resolved? i.e. monitoring in place so we get alerted so can fix it in advance next time? If so then I will happily close this ticket - thanks again

thenav56 · 2024-06-30T07:29:24Z

Hey @nanometrenat @szabozoltan69 @tovari,

We had some issues with the background tasks running on the same server as the API server. A memory leak in the background tasks affected the API server. We've added memory usage limits to the workers, which should fix the issue.

We also added swap, which impacted disk storage. This was fixed by using the temporary disk provided by Azure, as suggested by @szabozoltan69.

We've also been working on fixing the memory leak and are currently testing this in nightly. We've integrated Sentry profiling and cron monitoring, and we'll be pushing these changes to staging and production soon.

Let's keep this ticket open for now. Once we've pushed the changes to production, we can revisit and close it 😄

thenav56 · 2024-07-12T12:04:01Z

Update:
We have pushed the memory leak fix and integrated sentry monitoring and performance to the Risk module.

We can now use sentry to track and fix performance issues.

Also, added health-check to track running intances state
https://go-risk.northeurope.cloudapp.azure.com/health-check/

szabozoltan69 · 2024-07-12T12:24:25Z

https://go-risk.northeurope.cloudapp.azure.com/health-check/

Amazing...
Huge appreciation, @thenav56 !

nanometrenat · 2024-07-15T09:26:30Z

Thanks @thenav56 - brilliant news!

Closing this ticket on the basis that the underlying issue has been resolved and also monitoring has been added. Thanks once again to all!

nanometrenat assigned tovari Jun 26, 2024

nanometrenat closed this as completed Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROD] Performance issue - extremely slow loading of riskwatch data on GO - please investigate #2184

[PROD] Performance issue - extremely slow loading of riskwatch data on GO - please investigate #2184

nanometrenat commented Jun 26, 2024

tovari commented Jun 28, 2024

szabozoltan69 commented Jun 28, 2024 •

edited

Loading

nanometrenat commented Jun 28, 2024 •

edited

Loading

szabozoltan69 commented Jun 28, 2024

nanometrenat commented Jun 28, 2024

thenav56 commented Jun 30, 2024

thenav56 commented Jul 12, 2024

szabozoltan69 commented Jul 12, 2024

nanometrenat commented Jul 15, 2024

[PROD] Performance issue - extremely slow loading of riskwatch data on GO - please investigate #2184

[PROD] Performance issue - extremely slow loading of riskwatch data on GO - please investigate #2184

Comments

nanometrenat commented Jun 26, 2024

Issue

Screenshots etc.

Expected behaviour

tovari commented Jun 28, 2024

szabozoltan69 commented Jun 28, 2024 • edited Loading

nanometrenat commented Jun 28, 2024 • edited Loading

szabozoltan69 commented Jun 28, 2024

nanometrenat commented Jun 28, 2024

thenav56 commented Jun 30, 2024

thenav56 commented Jul 12, 2024

szabozoltan69 commented Jul 12, 2024

nanometrenat commented Jul 15, 2024

szabozoltan69 commented Jun 28, 2024 •

edited

Loading

nanometrenat commented Jun 28, 2024 •

edited

Loading