Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass GOMEMLIMIT as env variable #56

Merged
merged 2 commits into from
Jul 1, 2024

Conversation

sam6134
Copy link
Contributor

@sam6134 sam6134 commented Jun 28, 2024

Issue #, if available: https://sim.amazon.com/issues/CWQS-1421

Description of changes:
Neuron monitor pods are getting killed with OOM error (exit code 137). There is a mitigation which requires an update to neuron-monitor daemonset spec to pass an additional env var GOMEMLIMIT with value 160MB.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

TESTING

  • Enabled GC logs in neuron monitor (With Mem Limit 160MiB), less frequent GC cycles since limit way over board
kubectl logs -n amazon-cloudwatch neuron-monitor-75xsl
gc 1 @0.015s 7%: 1.6+7.9+0.006 ms clock, 13+0.20/0.41/0.003+0.055 ms cpu, 4->4->3 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 2 @10.026s 0%: 0.020+0.29+0.012 ms clock, 0.16+0.19/0.31/0.12+0.096 ms cpu, 7->7->4 MB, 7 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 3 @40.026s 0%: 0.025+0.27+0.019 ms clock, 0.20+0.096/0.35/0.26+0.15 ms cpu, 9->9->4 MB, 9 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 4 @69.713s 0%: 0.029+0.25+0.004 ms clock, 0.23+0.10/0.30/0.38+0.032 ms cpu, 9->9->4 MB, 9 MB goal, 0 MB stacks, 0 MB globals, 8 P
  • With MemLimit 5MiB (Too frequent Gcs, limit is very strict)
gc 1 @0.000s 3%: 0.010+0.12+0.002 ms clock, 0.082+0.054/0.11/0+0.020 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 2 @0.000s 9%: 0.046+0.13+0.002 ms clock, 0.37+0.066/0.15/0+0.017 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 3 @0.001s 10%: 0.008+0.12+0.002 ms clock, 0.066+0.069/0.14/0.013+0.016 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 4 @0.001s 12%: 0.048+0.15+0.001 ms clock, 0.38+0.064/0.15/0.003+0.015 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 5 @0.001s 15%: 0.049+0.095+0.003 ms clock, 0.39+0.059/0.12/0.019+0.026 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 6 @0.001s 16%: 0.061+0.17+0.003 ms clock, 0.49+0.071/0.071/0.020+0.026 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 7 @0.002s 18%: 0.061+0.11+0.003 ms clock, 0.49+0.048/0.10/0.005+0.025 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 8 @0.002s 19%: 0.059+0.12+0.002 ms clock, 0.47+0.061/0.11/0+0.019 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 9 @0.002s 19%: 0.008+0.11+0.002 ms clock, 0.064+0.063/0.12/0.054+0.017 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 10 @0.002s 20%: 0.069+0.13+0.002 ms clock, 0.55+0.050/0.11/0.033+0.018 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 11 @0.003s 21%: 0.052+0.11+0.003 ms clock, 0.42+0.060/0.10/0.050+0.030 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 12 @0.003s 20%: 0.006+0.22+0.001 ms clock, 0.048+0.076/0.074/0+0.013 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 13 @0.003s 21%: 0.050+0.11+0.002 ms clock, 0.40+0.047/0.10/0.021+0.020 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 14 @0.003s 22%: 0.050+0.12+0.028 ms clock, 0.40+0.028/0.084/0.021+0.22 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 15 @0.004s 22%: 0.035+0.10+0.002 ms clock, 0.28+0.052/0.053/0.040+0.023 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 16 @0.004s 22%: 0.057+0.084+0.002 ms clock, 0.46+0.044/0.093/0.007+0.020 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 17 @0.004s 23%: 0.057+0.081+0.003 ms clock, 0.46+0.041/0.098/0.009+0.027 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 18 @0.004s 23%: 0.050+0.12+0.002 ms clock, 0.40+0.047/0.085/0+0.022 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 19 @0.005s 24%: 0.048+0.15+0.027 ms clock, 0.38+0.033/0.11/0.047+0.21 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 20 @0.005s 24%: 0.047+0.13+0.002 ms clock, 0.38+0.059/0.14/0.037+0.017 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 21 @0.005s 24%: 0.062+0.17+0.002 ms clock, 0.50+0.055/0.16/0.025+0.019 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 22 @0.005s 25%: 0.10+0.11+0.002 ms clock, 0.86+0.042/0.14/0.017+0.019 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 23 @0.006s 25%: 0.059+0.16+0.004 ms clock, 0.47+0.047/0.11/0.026+0.035 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 24 @0.006s 25%: 0.032+0.14+0.003 ms clock, 0.26+0.071/0.15/0.027+0.027 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 25 @0.006s 26%: 0.059+0.081+0.002 ms clock, 0.47+0.033/0.10/0.044+0.022 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 26 @0.006s 26%: 0.034+0.11+0.002 ms clock, 0.27+0.055/0.13/0.036+0.019 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 27 @0.007s 26%: 0.010+0.16+0.002 ms clock, 0.082+0.053/0.18/0.035+0.018 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 28 @0.007s 26%: 0.057+0.13+0.002 ms clock, 0.45+0.050/0.095/0.053+0.016 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 29 @0.007s 26%: 0.061+0.11+0.002 ms clock, 0.49+0.092/0.062/0.008+0.021 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 30 @0.007s 26%: 0.027+0.13+0.003 ms clock, 0.22+0.062/0.095/0.15+0.027 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 31 @0.008s 26%: 0.065+0.15+0.002 ms clock, 0.52+0.050/0.22/0.037+0.017 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 32 @0.008s 26%: 0.078+0.15+0.002 ms clock, 0.63+0.046/0.12/0.038+0.017 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 33 @0.008s 26%: 0.066+0.16+0.003 ms clock, 0.53+0.042/0.095/0.028+0.030 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 34 @0.009s 26%: 0.054+0.15+0.002 ms clock, 0.43+0.045/0.15/0.012+0.016 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 35 @0.009s 27%: 0.061+0.11+0.002 ms clock, 0.49+0.071/0.12/0.018+0.021 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P
gc 36 @0.009s 27%: 0.054+0.11+0.001 ms clock, 0.43+0.046/0.14/0.026+0.014 ms cpu, 0->0->0 MB, 0 MB goal, 0 MB stacks, 0 MB globals, 8 P

@sam6134 sam6134 requested review from movence and aditya-purang June 28, 2024 11:05
@@ -35,6 +35,8 @@ spec:
fieldPath: spec.nodeName
- name: PATH
value: /usr/local/bin:/usr/bin:/bin:/opt/aws/neuron/bin
- name: GOMEMLIMIT
value: 160MB
Copy link
Contributor

@movence movence Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be an int value like 160*1024*1024?

Copy link
Contributor Author

@sam6134 sam6134 Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope - Its a GO runtime variable and it can understand the suffixes and supported suffixes include B, KiB, MiB, GiB, and TiB-> https://pkg.go.dev/runtime#hdr-Environment_Variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you verify if this env var is getting applied to neuron monitor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added testing

@sam6134 sam6134 requested a review from movence June 28, 2024 13:42
@sam6134 sam6134 merged commit 2322e71 into aws-observability:main Jul 1, 2024
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants