-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Nvidia/GPU] Introduce Nvidia GPU Integration #12768
base: main
Are you sure you want to change the base?
[Nvidia/GPU] Introduce Nvidia GPU Integration #12768
Conversation
Todo: Add k8s container, pod, and namespace info from labels Should also include labels / mapping for kubernetes container, pod, and namespace
container => kubernetes.container.name
Perhaps with corresponding dashboard elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution.
Shared inital set of comments from the 1st review.
- {{this}} | ||
{{/each}} | ||
period: {{period}} | ||
use_types: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want these options to be configurable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ingest pipeline and mappings expect a certain format coming from the prometheus input of metricbeat, i believe changing these will change the format and the ingest pipeline/mappings will not be valid
target_field: gpu.nvlink.bandwidth.total | ||
ignore_missing: true | ||
- rename: | ||
field: prometheus.DCGM_FI_DEV_GPU_UTIL.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While going through this issue, there is a mention about
- high resource utilisation
- deprecation of a few metrics including
DCGM_FI_DEV_GPU_UTIL
.
Is this scenario observed while integration testing? If there are metrics that are deprecated or result in high resource intensive, it would be best to not consider this metric for creating dashboard visualisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove this metric and consider replacing with the referenced ones
Added @daniela-elastic as the reviewer for the dashboard. |
💔 Build Failed
Failed CI StepsHistory
|
Proposed commit message
Introduce NVIDIA GPU Monitoring Integration
Checklist
changelog.yml
file.Author's Checklist
How to test this PR locally
Deploy NVIDIA DGCM on a device with an NVIDIA GPU to get a prometheus metrics endpoint that you can provide to the integration.
If you have docker this just requires:
Configure the integration to point at the host running the container and GPU
http://nvidiahost:9400/metrics
Some metrics are not enabled by default with the container, enabling all metrics requires some extra steps.
Related issues
Fixes #11930
Screenshots
WIP: