[Nvidia/GPU] Introduce Nvidia GPU Integration #12768

strawgate · 2025-02-13T14:57:52Z

Proposed commit message

Introduce NVIDIA GPU Monitoring Integration

Checklist

I have reviewed tips for building integrations and this pull request is aligned with them.
I have verified that all data streams collect metrics or logs.
I have added an entry to my package's changelog.yml file.
I have verified that Kibana version constraints are current according to guidelines.
I have verified that any added dashboard complies with Kibana's Dashboard good practices

Author's Checklist

How to test this PR locally

Deploy NVIDIA DGCM on a device with an NVIDIA GPU to get a prometheus metrics endpoint that you can provide to the integration.

If you have docker this just requires:

docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04
curl localhost:9400/metrics

Configure the integration to point at the host running the container and GPU http://nvidiahost:9400/metrics

Some metrics are not enabled by default with the container, enabling all metrics requires some extra steps.

Related issues

Fixes #11930

Screenshots

WIP:

strawgate · 2025-02-19T00:33:46Z

Todo: Add k8s container, pod, and namespace info from labels

Should also include labels / mapping for kubernetes container, pod, and namespace

container="dcgmproftester11",namespace="default",pod="dcgmproftester"

container => kubernetes.container.name
namespace => kubernetes.namespace
pod => kubernetes.pod.name

- name: kubernetes
  type: group
  fields:
    - name: pod.name
      type: keyword
      description: >
        Kubernetes pod name
    - name: container.name
      type: keyword
      description: >
        Kubernetes container name
    - name: namespace
      type: keyword
      description: >
        Kubernetes namespace

- rename:
    field: prometheus.labels.container
    target_field: kubernetes.container.name
    ignore_missing: true
- rename:
    field: prometheus.labels.namespace
    target_field: kubernetes.namespace
    ignore_missing: true
- rename:
    field: prometheus.labels.pod
    target_field: kubernetes.pod.name
    ignore_missing: true

Perhaps with corresponding dashboard elements

ishleenk17

Thanks for the contribution.
Shared inital set of comments from the 1st review.

packages/nvidia_gpu/manifest.yml

packages/nvidia_gpu/changelog.yml

ishleenk17 · 2025-02-19T06:49:04Z

packages/nvidia_gpu/data_stream/stats/agent/stream/stream.yml.hbs

+  - {{this}}
+{{/each}}
+period: {{period}}
+use_types: true


We don't want these options to be configurable ?

the ingest pipeline and mappings expect a certain format coming from the prometheus input of metricbeat, i believe changing these will change the format and the ingest pipeline/mappings will not be valid

packages/nvidia_gpu/data_stream/stats/fields/fields.yml

packages/nvidia_gpu/data_stream/stats/manifest.yml

packages/nvidia_gpu/data_stream/stats/sample_event.json

packages/nvidia_gpu/_dev/build/docs/README.md

agithomas · 2025-02-19T08:07:53Z

packages/nvidia_gpu/data_stream/stats/elasticsearch/ingest_pipeline/default.yml

+    target_field: gpu.nvlink.bandwidth.total
+    ignore_missing: true
+- rename:
+    field: prometheus.DCGM_FI_DEV_GPU_UTIL.value


While going through this issue, there is a mention about

high resource utilisation

deprecation of a few metrics including DCGM_FI_DEV_GPU_UTIL.

Is this scenario observed while integration testing? If there are metrics that are deprecated or result in high resource intensive, it would be best to not consider this metric for creating dashboard visualisation.

Will remove this metric and consider replacing with the referenced ones

packages/nvidia_gpu/data_stream/stats/fields/fields.yml

packages/nvidia_gpu/manifest.yml

agithomas · 2025-02-19T08:45:01Z

Added @daniela-elastic as the reviewer for the dashboard.

elasticmachine · 2025-02-19T16:44:05Z

💔 Build Failed

Buildkite Build
Commit: 11d5263

Failed CI Steps

Check integrations nvidia_gpu

History

💔 Build #22526 failed 58dae3b
💔 Build #22483 failed 4aaef17
💔 Build #22255 failed ec8bb4a
💔 Build #22241 failed 0bce5f9

New branch for Nvidia GPU Integration

0bce5f9

This was referenced Feb 13, 2025

[NVIDIA GPU] Introduce Monitoring Integration #12581

Closed

[NVIDIA GPU] Introduce Monitoring Integration #11931

Closed

add codeowners

ec8bb4a

Set owner and add support for k8s-related labels

4aaef17

lalit-satapathy requested review from agithomas and ishleenk17 February 19, 2025 06:43

ishleenk17 reviewed Feb 19, 2025

View reviewed changes

agithomas reviewed Feb 19, 2025

View reviewed changes

packages/nvidia_gpu/_dev/build/docs/README.md Show resolved Hide resolved

agithomas reviewed Feb 19, 2025

View reviewed changes

packages/nvidia_gpu/data_stream/stats/fields/fields.yml Show resolved Hide resolved

agithomas reviewed Feb 19, 2025

View reviewed changes

packages/nvidia_gpu/manifest.yml Show resolved Hide resolved

agithomas requested a review from daniela-elastic February 19, 2025 08:44

strawgate added 2 commits February 19, 2025 09:39

Updates from PR Feedback

58dae3b

Small formatting updates

11d5263

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Nvidia/GPU] Introduce Nvidia GPU Integration #12768

[Nvidia/GPU] Introduce Nvidia GPU Integration #12768

strawgate commented Feb 13, 2025

strawgate commented Feb 19, 2025

ishleenk17 left a comment

ishleenk17 Feb 19, 2025

strawgate Feb 19, 2025

agithomas Feb 19, 2025

strawgate Feb 19, 2025

agithomas commented Feb 19, 2025

elasticmachine commented Feb 19, 2025 •

edited

Loading

[Nvidia/GPU] Introduce Nvidia GPU Integration #12768

Are you sure you want to change the base?

[Nvidia/GPU] Introduce Nvidia GPU Integration #12768

Conversation

strawgate commented Feb 13, 2025

Proposed commit message

Checklist

Author's Checklist

How to test this PR locally

Related issues

Screenshots

strawgate commented Feb 19, 2025

ishleenk17 left a comment

Choose a reason for hiding this comment

ishleenk17 Feb 19, 2025

Choose a reason for hiding this comment

strawgate Feb 19, 2025

Choose a reason for hiding this comment

agithomas Feb 19, 2025

Choose a reason for hiding this comment

strawgate Feb 19, 2025

Choose a reason for hiding this comment

agithomas commented Feb 19, 2025

elasticmachine commented Feb 19, 2025 • edited Loading

💔 Build Failed

Failed CI Steps

History

elasticmachine commented Feb 19, 2025 •

edited

Loading