Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Nvidia/GPU] Introduce Nvidia GPU Integration #12768

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,7 @@
/packages/nginx @elastic/obs-infraobs-integrations
/packages/nginx_ingress_controller @elastic/obs-cloudnative-monitoring
/packages/nginx_ingress_controller_otel @elastic/obs-infraobs-integrations
/packages/nvidia_gpu @elastic/obs-infraobs-integrations
/packages/o365 @elastic/security-service-integrations
/packages/okta @elastic/security-service-integrations
/packages/openai @elastic/obs-infraobs-integrations
Expand Down
93 changes: 93 additions & 0 deletions packages/nvidia_gpu/LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
Elastic License 2.0

URL: https://www.elastic.co/licensing/elastic-license

## Acceptance

By using the software, you agree to all of the terms and conditions below.

## Copyright License

The licensor grants you a non-exclusive, royalty-free, worldwide,
non-sublicensable, non-transferable license to use, copy, distribute, make
available, and prepare derivative works of the software, in each case subject to
the limitations and conditions below.

## Limitations

You may not provide the software to third parties as a hosted or managed
service, where the service provides users with access to any substantial set of
the features or functionality of the software.

You may not move, change, disable, or circumvent the license key functionality
in the software, and you may not remove or obscure any functionality in the
software that is protected by the license key.

You may not alter, remove, or obscure any licensing, copyright, or other notices
of the licensor in the software. Any use of the licensor’s trademarks is subject
to applicable law.

## Patents

The licensor grants you a license, under any patent claims the licensor can
license, or becomes able to license, to make, have made, use, sell, offer for
sale, import and have imported the software, in each case subject to the
limitations and conditions in this license. This license does not cover any
patent claims that you cause to be infringed by modifications or additions to
the software. If you or your company make any written claim that the software
infringes or contributes to infringement of any patent, your patent license for
the software granted under these terms ends immediately. If your company makes
such a claim, your patent license ends immediately for work on behalf of your
company.

## Notices

You must ensure that anyone who gets a copy of any part of the software from you
also gets a copy of these terms.

If you modify the software, you must include in any modified copies of the
software prominent notices stating that you have modified the software.

## No Other Rights

These terms do not imply any licenses other than those expressly granted in
these terms.

## Termination

If you use the software in violation of these terms, such use is not licensed,
and your licenses will automatically terminate. If the licensor provides you
with a notice of your violation, and you cease all violation of this license no
later than 30 days after you receive that notice, your licenses will be
reinstated retroactively. However, if you violate these terms after such
reinstatement, any additional violation of these terms will cause your licenses
to terminate automatically and permanently.

## No Liability

*As far as the law allows, the software comes as is, without any warranty or
condition, and the licensor will not be liable to you for any damages arising
out of these terms or the use or nature of the software, under any kind of
legal claim.*

## Definitions

The **licensor** is the entity offering these terms, and the **software** is the
software the licensor makes available under these terms, including any portion
of it.

**you** refers to the individual or entity agreeing to these terms.

**your company** is any legal entity, sole proprietorship, or other kind of
organization that you work for, plus all organizations that have control over,
are under the control of, or are under common control with that
organization. **control** means ownership of substantially all the assets of an
entity, or the power to direct its management and policies by vote, contract, or
otherwise. Control can be direct or indirect.

**your licenses** are all the licenses granted to you for the software under
these terms.

**use** means anything you do with the software requiring one of your licenses.

**trademark** means trademarks, service marks, and similar rights.
26 changes: 26 additions & 0 deletions packages/nvidia_gpu/_dev/build/docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Nvidia GPU Monitoring

Use the NVIDIA GPU Monitoring integration to monitor the health and performance of your NVIDIA GPUs. The integration collects metrics from the NVIDIA Datacenter GPU Manager and sends them to Elasticsearch.

## Data streams

**stats** give you insight into the state of the NVIDIA GPUs.
Metric data streams collected by the Nvidia GPU Monitoring integration include `stats`. See more details in the [Metrics](#metrics-reference).

## Requirements

You need Elasticsearch for storing and searching your data and Kibana for visualizing and managing it.
You can use our hosted Elasticsearch Service on Elastic Cloud, which is recommended, or self-manage the Elastic Stack on your own hardware.

You need the NVIDIA Datacenter GPU Manager (DCGM) installed on your system (or exposed via a docker container with the GPU device mounted) to collect metrics from the NVIDIA GPUs. You can download the DCGM from the [NVIDIA website](https://developer.nvidia.com/dcgm). By default the DCGM exporter does not expose all available metrics.

## Setup

For step-by-step instructions on how to set up an integration, see the
[Getting started](https://www.elastic.co/guide/en/welcome-to-elastic/current/getting-started-observability.html) guide.

When running on Kubernetes, you can use ${env.NODE_NAME} to get the node name for use in the hosts field. For example: `hosts: http://${env.NODE_NAME}:9400/metrics`.


{{event "stats"}}
{{fields "stats"}}
6 changes: 6 additions & 0 deletions packages/nvidia_gpu/changelog.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# newer versions go on top
- version: "0.1.0"
changes:
- description: Initial introduction of Nvidia GPU Monitoring
type: enhancement
link: https://github.com/elastic/integrations/pull/11931
27 changes: 27 additions & 0 deletions packages/nvidia_gpu/data_stream/stats/agent/stream/stream.yml.hbs
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
hosts:
{{#each hosts}}
- {{this}}
{{/each}}
period: {{period}}
use_types: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want these options to be configurable ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ingest pipeline and mappings expect a certain format coming from the prometheus input of metricbeat, i believe changing these will change the format and the ingest pipeline/mappings will not be valid

rate_counters: false
username: {{username}}
password: {{password}}
metrics_filters.exclude:
{{#each metrics_filters.exclude}}
- {{this}}
{{/each}}
metrics_filters.include:
{{#each metrics_filters.include}}
- {{this}}
{{/each}}
{{#if ssl.certificate_authorities}}
ssl.certificate_authorities:
{{#each ssl.certificate_authorities}}
- {{this}}
{{/each}}
{{/if}}
{{#if processors}}
processors:
{{processors}}
{{/if}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
---
description: Pipeline for NVIDIA GPU Metrics
processors:
- rename:
field: prometheus.DCGM_FI_DEV_MEM_COPY_UTIL.value
target_field: gpu.memory.copy_utilization
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_FB_USED.value
target_field: gpu.framebuffer.size.used
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_FB_FREE.value
target_field: gpu.framebuffer.size.free
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_POWER_USAGE.value
target_field: gpu.power.usage
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_SM_CLOCK.value
target_field: gpu.streaming_multiprocessor.frequency
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_ENC_UTIL.value
target_field: gpu.encoder.utilization
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_DEC_UTIL.value
target_field: gpu.decoder.utilization
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_GPU_TEMP.value
target_field: gpu.temperature
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_VGPU_LICENSE_STATUS.value
target_field: gpu.license.vgpu
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_MEM_CLOCK.value
target_field: gpu.memory.frequency
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION.counter
target_field: gpu.energy.total
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL.counter
target_field: gpu.nvlink.bandwidth.total
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_GPU_UTIL.value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While going through this issue, there is a mention about

  • high resource utilisation
  • deprecation of a few metrics including DCGM_FI_DEV_GPU_UTIL.

Is this scenario observed while integration testing? If there are metrics that are deprecated or result in high resource intensive, it would be best to not consider this metric for creating dashboard visualisation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove this metric and consider replacing with the referenced ones

target_field: gpu.utilization
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_MEMORY_TEMP.value
target_field: gpu.memory.temperature
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_PCIE_REPLAY_COUNTER.rate
target_field: gpu.pcie.replay
ignore_missing: true
- rename:
field: prometheus.labels.modelName
target_field: gpu.device.model
ignore_missing: true
- rename:
field: prometheus.labels.instance
target_field: prometheus.node.name
ignore_missing: true
- rename:
field: prometheus.labels.pci_bus_id
target_field: gpu.pci.bus.id
ignore_missing: true
- rename:
field: prometheus.labels.Hostname
target_field: prometheus.node.hostname
ignore_missing: true
- rename:
field: prometheus.labels.job
target_field: prometheus.node.job
ignore_missing: true
- rename:
field: prometheus.labels.DCGM_FI_DRIVER_VERSION
target_field: gpu.driver.version
ignore_missing: true
- rename:
field: prometheus.labels.UUID
target_field: gpu.device.uuid
ignore_missing: true
- rename:
field: prometheus.labels.device
target_field: gpu.device.name
ignore_missing: true
- rename:
field: prometheus.labels.gpu
target_field: gpu.device.id
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_XID_ERRORS.value
target_field: gpu.error.xid
ignore_missing: true
- rename:
field: prometheus.labels.err_code
target_field: gpu.error.code
ignore_missing: true
- rename:
field: prometheus.labels.err_msg
target_field: gpu.error.message
ignore_missing: true
- rename:
field: prometheus.labels.container
target_field: kubernetes.container.name
ignore_missing: true
- rename:
field: prometheus.labels.namespace
target_field: kubernetes.namespace
ignore_missing: true
- rename:
field: prometheus.labels.pod
target_field: kubernetes.pod.name
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_ECC_DBE_AGG_TOTAL.rate
target_field: gpu.memory.errors.double_bit_persistent
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_ECC_SBE_AGG_TOTAL.rate
target_field: gpu.memory.errors.single_bit_persistent
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_ECC_DBE_VOL_TOTAL.rate
target_field: gpu.memory.errors.double_bit_volatile
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_ECC_SBE_VOL_TOTAL.rate
target_field: gpu.memory.errors.single_bit_volatile
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_SYNC_BOOST_VIOLATION.rate
target_field: gpu.throttling.sync_boost
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_THERMAL_VIOLATION.rate
target_field: gpu.throttling.thermal
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_LOW_UTIL_VIOLATION.rate
target_field: gpu.throttling.low_utilization
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_BOARD_LIMIT_VIOLATION.rate
target_field: gpu.throttling.board_limit
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_POWER_VIOLATION.rate
target_field: gpu.throttling.power
ignore_missing: true
- rename:
field: prometheus.DCGM_FI_DEV_RELIABILITY_VIOLATION.rate
target_field: gpu.throttling.reliability
ignore_missing: true
- rename:
field: prometheus.labels.DCGM_FI_NVML_VERSION
target_field: gpu.driver.nvml_version
ignore_missing: true
- rename:
field: prometheus.labels.DCGM_FI_DEV_OEM_INFOROM_VER
target_field: gpu.device.info_rom.oem_version
ignore_missing: true
- rename:
field: prometheus.labels.DCGM_FI_DEV_INFOROM_IMAGE_VER
target_field: gpu.device.info_rom.version
ignore_missing: true
- rename:
field: prometheus.labels.DCGM_FI_DEV_VBIOS_VERSION
target_field: gpu.device.vbios.version
ignore_missing: true
- rename:
field: prometheus.labels.DCGM_FI_DEV_BRAND
target_field: gpu.device.brand
ignore_missing: true

- remove:
field:
- "prometheus.DCGM_FI_DEV_PCIE_REPLAY_COUNTER"
- "prometheus.DCGM_FI_DEV_ECC_DBE_AGG_TOTAL"
- "prometheus.DCGM_FI_DEV_ECC_SBE_AGG_TOTAL"
- "prometheus.DCGM_FI_DEV_ECC_DBE_VOL_TOTAL"
- "prometheus.DCGM_FI_DEV_ECC_SBE_VOL_TOTAL"
- "prometheus.DCGM_FI_DEV_SYNC_BOOST_VIOLATION"
- "prometheus.DCGM_FI_DEV_THERMAL_VIOLATION"
- "prometheus.DCGM_FI_DEV_LOW_UTIL_VIOLATION"
- "prometheus.DCGM_FI_DEV_BOARD_LIMIT_VIOLATION"
- "prometheus.DCGM_FI_DEV_POWER_VIOLATION"
- "prometheus.DCGM_FI_DEV_RELIABILITY_VIOLATION"
- "prometheus.DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION"
- "prometheus.DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL"
ignore_missing: true
ignore_failure: true
12 changes: 12 additions & 0 deletions packages/nvidia_gpu/data_stream/stats/fields/base-fields.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
- name: data_stream.type
type: constant_keyword
description: Data stream type.
- name: data_stream.dataset
type: constant_keyword
description: Data stream dataset.
- name: data_stream.namespace
type: constant_keyword
description: Data stream namespace.
- name: '@timestamp'
type: date
description: Event timestamp.
Loading