Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVIDIA GPU] Introduce Monitoring Integration #11931

Closed
wants to merge 5,693 commits into from

Conversation

strawgate
Copy link
Contributor

@strawgate strawgate commented Nov 30, 2024

Proposed commit message

Introduce NVIDIA GPU Monitoring Integration

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

Author's Checklist

How to test this PR locally

Deploy NVIDIA DGCM on a device with an NVIDIA GPU to get a prometheus metrics endpoint that you can provide to the integration.

If you have docker this just requires:

docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04
curl localhost:9400/metrics

Configure the integration to point at the host running the container and GPU http://nvidiahost:9400/metrics

Some metrics are not enabled by default with the container, enabling all metrics requires some extra steps.

Related issues

Fixes #11930

Screenshots

WIP:
Screenshot 2024-11-30 at 3 35 33 PM
Screenshot 2024-11-30 at 3 35 44 PM
Screenshot 2024-11-30 at 3 35 56 PM
Screenshot 2024-11-30 at 3 36 03 PM

efd6 and others added 30 commits October 15, 2024 07:55
The documentation for the deprecation of fields indicates the following
correspondences:

old	                              new
is_synthetic_quarantine_disposition   pattern_disposition* to identify quarantined files
has_script_or_module_ioc              ioc_context
ioc_values                            ioc_value

However, there is no other information relating to how these correspond
with each other.

By inspection of documents from an alerts stream, we can see that
pattern_disposition_details contains a quarantine_file boolean. This,
with the text in the deprecation notice, hints that we can use this
field to get the is_synthetic_quarantine_disposition. The ioc_context
field contains an array of object with a type property which in the
examples I have available include (only) "module", hinting that this can
be used to detect the state corresponding to has_script_or_module_ioc.
Finally, ioc_value fields are sprinkled around the documents, so collect them
into ioc_values.

The test case is derived from the first case, but with deprecated fields
removed.
…ls to index (elastic#11372)

The event.action field is an implementation detail that has an unfortunate name
that could mislead users; the values held for entities do not relate to security
details, but only to internal accounting. So remove them.
…stic#11397)

* Release aws package with  (and add missing data)

* Update changelog PR link
…tic#11400)

* Bump github.com/elastic/elastic-package from 0.104.0 to 0.105.0

Bumps [github.com/elastic/elastic-package](https://github.com/elastic/elastic-package) from 0.104.0 to 0.105.0.
- [Release notes](https://github.com/elastic/elastic-package/releases)
- [Changelog](https://github.com/elastic/elastic-package/blob/main/.goreleaser.yml)
- [Commits](elastic/elastic-package@v0.104.0...v0.105.0)

---
updated-dependencies:
- dependency-name: github.com/elastic/elastic-package
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Remove white line in vsphere README

* Added deployment_mode and properties

* Added owner info in Elastic Connector

* review fixes

* remove duplicated changelog entries and update missed team handle

* [elastic_connectors] Update manifest version

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
Co-authored-by: Sean Rathier <sean.rathier@elastic.co>
Co-authored-by: Maxim Kholod <maxim.kholod@elastic.co>
Co-authored-by: Mario Rodriguez Molins <mario.rodriguez@elastic.co>
…line errors (elastic#11112)

Also fix instances of incorrect yaml for script processors.
…logo (elastic#11407)

* add a link to the onboarding flow, fix the package logo

Signed-off-by: Tetiana Kravchenko <tetiana.kravchenko@elastic.co>

* fix pr link

Signed-off-by: Tetiana Kravchenko <tetiana.kravchenko@elastic.co>

* dashboard: add filter to statefulset vizualisation

Signed-off-by: Tetiana Kravchenko <tetiana.kravchenko@elastic.co>

---------

Signed-off-by: Tetiana Kravchenko <tetiana.kravchenko@elastic.co>
* add extended space metrics

* update changelog

* update readme

* address review comments.

* address review comments

* update dashboards

---------

Co-authored-by: Niraj Rathod <niraj.rathod@crestdatasys.com>
* reverting session_data toggle

* updating PR changelog

* fixing change type

* reverting kibana changes
Remove the handlebars templating into the CEL code and celfmt.
* Fix AWS Bedrock documentation and dashboard issues
Bumps [github.com/cli/go-gh/v2](https://github.com/cli/go-gh) from 2.10.0 to 2.11.0.
- [Release notes](https://github.com/cli/go-gh/releases)
- [Commits](cli/go-gh@v2.10.0...v2.11.0)

---
updated-dependencies:
- dependency-name: github.com/cli/go-gh/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Move pricing info to the top of the table

* Update kinesis

* Update apigateway

* Update billing

* Add reference to pricing info

* Update aws package

* Update aws-bedrock package

* Update changelog for AWS package

* Update changelog for AWS_Bedrock package

* Fix broken link

* Update manifest

* Update packages/aws/manifest.yml

Co-authored-by: Ishleen Kaur <102962586+ishleenk17@users.noreply.github.com>

* Fix manifest version

* Update manifest after conflicts fix

---------

Co-authored-by: Ishleen Kaur <102962586+ishleenk17@users.noreply.github.com>
* update readme

* update manifest and changelog
…et handling (elastic#11422)

We cannot guarantee the shape of body.message and body.message_detail, so just
include the known type resp.Body which should be short in the case of a
non-200 response.

The next_offset value is documented to be an array of two elements which must
be used to construct the parameter by concatenation with a separating comma[1].

[1] https://duo.com/docs/adminapi#authentication-logs
…tic#11439)

Bumps [github.com/elastic/elastic-package](https://github.com/elastic/elastic-package) from 0.105.0 to 0.106.0.
- [Release notes](https://github.com/elastic/elastic-package/releases)
- [Changelog](https://github.com/elastic/elastic-package/blob/main/.goreleaser.yml)
- [Commits](elastic/elastic-package@v0.105.0...v0.106.0)

---
updated-dependencies:
- dependency-name: github.com/elastic/elastic-package
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…1435)

* Update package version

* Update package version and changelog
In elastic#11240 we attempted to fix preparation of endTime where it could be more than
24 hours after the startTime parameter, a case that is invalid for the API. To
do this, we just always added 24 hours to the startTime on the basis that the
API would crop at the present moment, and — notably — under the assumption that
the returned document would be used to construct the next startTime. Clearly
this is not the case, since we use the actual endTime parameter to construct the
next startTime; the API does not provide an HTTPJSON-easy way to get the last
timestamp. So set the endTime query parameter to the earliest of now and 24
hours after startTime.

Queries with an end time of now may not get all documents up to now. Since we
are using the query parameter to define the end of the range of documents that
have definitively been retrieved, this means that we will never try to get these
missed documents. We could us the maximum timestamp of the collected documents
to define this range, but that less easy to do here, so just set the end of the
time range to be some short period before now. Ten seconds is chosen as a
conservative value (the report claimed milliseconds of documents were lost).
The Duo Admin API has rate limiting. It doesn't return rate limit
headers, but it does enforce limits with HTTP 429 responses.

For some endpoints, the API documentation specifies "a rate limit of 50
calls per minute". This as also been observed on the authentication
logs endpoint.

This changes sets a limit of 0.5 calls/second or 30 calls per minute
for all data streams and inputs.

HTTP 429 responses continue to be treated as errors.

API documentation: https://duo.com/docs/adminapi

---------

Co-authored-by: Dan Kortschak <dan.kortschak@elastic.co>
…>=8.16.0 (elastic#11413)

* make Asset Inventory compatible going forward

* update manifest and changelog

* Update changelog.yml

* Update manifest.yml
* Bump up version

* update changelog

* update manifest version
…ic#11437)

* lowercase host.name for cloud_secrity_posture

* add PR link to changelog
In the auth CEL program, the error `type conversion error from 'string'
to 'int'` seems to have been happening because of
`cursor.last_published` being set to a value of the form
`1532951895000,af0ba235-0b33-23c8-bc23-a31aa0231de8`, which can't be
parsed as an int.

Now `cursor.last_published` is replaced with `cursor.last_timestamp_ms`,
which is taken from the last result, and so will be available if the
last page of sequence has results but no value in
`response.metadata.next_offset`.

Also, the date is no longer shared across requests, the request
building is simplified, and redundant overrides of state are removed.

Related documentation: https://duo.com/docs/adminapi#authentication-logs
SimonKoetting and others added 8 commits November 29, 2024 07:03
* Add required permissions for AWS custom logs

* Update changelog and manifest
…stic#11922)

Made with ❤️️ by updatecli

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…stic#11918)

* Added support for enterprise audit logs in the audit log data stream.
* Fix broken link to sql module example

* Update changelog and manifest
* Replace 8.15 with 8.13

* Replace 8.15 with 8.13

* Update changelog and manifest
The HTTP Headers field (`panw.panos.http_headers`) of the incoming data
is incorrectly escaped. This will be fixed if necessary before CSV
parsing.

Map the file name value in the URL/Filename (`panw.panos.misc`) field
for the `wildfire` and `wildfire-virus` sub-types.
@strawgate strawgate added enhancement New feature or request New Integration Issue or pull request for creating a new integration package. labels Nov 30, 2024
@botelastic
Copy link

botelastic bot commented Jan 1, 2025

Hi! We just realized that we haven't looked into this PR in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Jan 1, 2025
@botelastic
Copy link

botelastic bot commented Jan 31, 2025

Hi! This PR has been stale for a while and we're going to close it as part of our cleanup procedure. We appreciate your contribution and would like to apologize if we have not been able to review it, due to the current heavy load of the team. Feel free to re-open this PR if you think it should stay open and is worth rebasing. Thank you for your contribution!

@botelastic botelastic bot closed this Jan 31, 2025
@strawgate strawgate reopened this Feb 4, 2025
@botelastic botelastic bot removed the Stalled label Feb 4, 2025
@elasticmachine
Copy link

💔 Build Failed

Failed CI Steps

History

@strawgate strawgate closed this Feb 4, 2025
@strawgate strawgate deleted the nvidia_gpu branch February 7, 2025 15:25
@strawgate
Copy link
Contributor Author

Replaced by #12768

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Integration:abnormal_security Abnormal Security Integration:1password 1Password New Integration Issue or pull request for creating a new integration package.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Nvidia GPU] New Integration for Nvidia GPU Monitoring