Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Siem migrations] Implement rate limit backoff #211469

Merged
merged 16 commits into from
Feb 21, 2025

Conversation

semd
Copy link
Contributor

@semd semd commented Feb 17, 2025

Summary

Implements an exponential backoff retry strategy when the LLM API throws rate limit (429) errors.

Backoff implementation

  • The run method from the RuleMigrationsTaskClient has been moved to the new RuleMigrationTaskRunner class.
  • The settings for the backoff are defined in this class with:
/** Exponential backoff configuration to handle rate limit errors */
const RETRY_CONFIG = {
  initialRetryDelaySeconds: 1,
  backoffMultiplier: 2,
  maxRetries: 8,
  // max waiting time 4m15s (1*2^8 = 256s)
} as const;
  • Only one rule will be retried at a time, the rest of the concurrent rule translations blocked by the rate limit will await for the API to recover before attempting the translation again.
/** Executor sleep configuration
 * A sleep time applied at the beginning of each single rule translation in the execution pool,
 * The objective of this sleep is to spread the load of concurrent translations, and prevent hitting the rate limit repeatedly.
 * The sleep time applied is a random number between [0-value]. Every time we hit rate limit the value is increased by the multiplier, up to the limit.
 */
const EXECUTOR_SLEEP = {
  initialValueSeconds: 3,
  multiplier: 2,
  limitSeconds: 96, // 1m36s (5 increases)
} as const;

Migration batching changes

/** Number of concurrent rule translations in the pool */
const TASK_CONCURRENCY = 10 as const;
/** Number of rules loaded in memory to be translated in the pool */
const TASK_BATCH_SIZE = 100 as const;

Before

  • Batches of 15 rules were retrieved and executed in a Promise.all, requiring all of them to be completed before proceeding to the next batch.
  • A "batch sleep" of 10s was executed at the end of each iteration.

In this PR

  • Batches of 100 rules are retrieved and kept in memory. The execution is performed in a task pool with a concurrency of 10 rules. This ensures there are always 10 rules executing at a time.
  • The "batch sleep" has been removed in favour of an "execution sleep" of rand[1-3]s at the start of each single rule migration. This individual sleep serves two goals:
    • Spread the load when the migration is first launched.
    • Prevent hitting the rate limit consistently: The sleep duration is increased every time we hit a rate limit.

@semd semd added release_note:skip Skip the PR/issue when compiling release notes v9.0.0 Team:Threat Hunting Security Solution Threat Hunting Team backport:version Backport to applied version labels v8.18.0 v9.1.0 v8.19.0 labels Feb 17, 2025
@semd semd self-assigned this Feb 17, 2025
@semd semd marked this pull request as ready for review February 19, 2025 19:39
@semd semd requested a review from a team as a code owner February 19, 2025 19:39
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-threat-hunting (Team:Threat Hunting)

Copy link
Member

@P1llus P1llus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the initial review it looks LGTM to me. Its easier to test more once its merged and we might stumble over something but from what I can see it looks great!
I was mostly looking at the telemetry, graph execution, error detection etc, I think testing it with a larger batch of rules can better be done once its merged considering there is also unit tests and manual tests performed by you as well.

Gj!

Copy link
Contributor

@logeekal logeekal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great 🚀 . Just posted some small nits and questions.

@semd semd enabled auto-merge (squash) February 21, 2025 18:11
@semd semd merged commit 64426b2 into elastic:main Feb 21, 2025
9 checks passed
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.18, 8.x, 9.0

https://github.com/elastic/kibana/actions/runs/13464210510

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

cc @semd

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Feb 21, 2025
…astic#211469)

## Summary

Implements an exponential backoff retry strategy when the LLM API throws
rate limit (`429`) errors.

### Backoff implementation

- The `run` method from the `RuleMigrationsTaskClient` has been moved to
the new `RuleMigrationTaskRunner` class.
- The settings for the backoff are defined in this class with:
```ts
/** Exponential backoff configuration to handle rate limit errors */
const RETRY_CONFIG = {
  initialRetryDelaySeconds: 1,
  backoffMultiplier: 2,
  maxRetries: 8,
  // max waiting time 4m15s (1*2^8 = 256s)
} as const;
```
- Only one rule will be retried at a time, the rest of the concurrent
rule translations blocked by the rate limit will await for the API to
recover before attempting the translation again.

```ts
/** Executor sleep configuration
 * A sleep time applied at the beginning of each single rule translation in the execution pool,
 * The objective of this sleep is to spread the load of concurrent translations, and prevent hitting the rate limit repeatedly.
 * The sleep time applied is a random number between [0-value]. Every time we hit rate limit the value is increased by the multiplier, up to the limit.
 */
const EXECUTOR_SLEEP = {
  initialValueSeconds: 3,
  multiplier: 2,
  limitSeconds: 96, // 1m36s (5 increases)
} as const;
```

### Migration batching changes

```ts
/** Number of concurrent rule translations in the pool */
const TASK_CONCURRENCY = 10 as const;
/** Number of rules loaded in memory to be translated in the pool */
const TASK_BATCH_SIZE = 100 as const;
```

#### Before

- Batches of 15 rules were retrieved and executed in a `Promise.all`,
requiring all of them to be completed before proceeding to the next
batch.
- A "batch sleep" of 10s was executed at the end of each iteration.

#### In this PR

- Batches of 100 rules are retrieved and kept in memory. The execution
is performed in a task pool with a concurrency of 10 rules. This ensures
there are always 10 rules executing at a time.
- The "batch sleep" has been removed in favour of an "execution sleep"
of rand[1-3]s at the start of each single rule migration. This
individual sleep serves two goals:
  - Spread the load when the migration is first launched.
- Prevent hitting the rate limit consistently: The sleep duration is
increased every time we hit a rate limit.

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit 64426b2)
@kibanamachine
Copy link
Contributor

💔 Some backports could not be created

Status Branch Result
8.18
8.x Backport failed because of merge conflicts

You might need to backport the following PRs to 8.x:
- [Rule Migration] Telemetry improvements (#210275)
9.0 Backport failed because of merge conflicts

You might need to backport the following PRs to 9.0:
- [Rule Migration] Telemetry improvements (#210275)

Note: Successful backport PRs will be merged automatically after passing CI.

Manual backport

To create the backport manually run:

node scripts/backport --pr 211469

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Feb 21, 2025
…off (#211469) (#212154)

# Backport

This will backport the following commits from `main` to `8.18`:
- [[Security Solution][Siem migrations] Implement rate limit backoff
(#211469)](#211469)

<!--- Backport version: 9.6.6 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sorenlouv/backport)

<!--BACKPORT [{"author":{"name":"Sergi
Massaneda","email":"sergi.massaneda@elastic.co"},"sourceCommit":{"committedDate":"2025-02-21T19:54:40Z","message":"[Security
Solution][Siem migrations] Implement rate limit backoff (#211469)\n\n##
Summary\n\nImplements an exponential backoff retry strategy when the LLM
API throws\nrate limit (`429`) errors.\n\n### Backoff
implementation\n\n- The `run` method from the `RuleMigrationsTaskClient`
has been moved to\nthe new `RuleMigrationTaskRunner` class.\n- The
settings for the backoff are defined in this class with:\n```ts\n/**
Exponential backoff configuration to handle rate limit errors */\nconst
RETRY_CONFIG = {\n initialRetryDelaySeconds: 1,\n backoffMultiplier:
2,\n maxRetries: 8,\n // max waiting time 4m15s (1*2^8 = 256s)\n} as
const;\n```\n- Only one rule will be retried at a time, the rest of the
concurrent\nrule translations blocked by the rate limit will await for
the API to\nrecover before attempting the translation
again.\n\n```ts\n/** Executor sleep configuration\n * A sleep time
applied at the beginning of each single rule translation in the
execution pool,\n * The objective of this sleep is to spread the load of
concurrent translations, and prevent hitting the rate limit
repeatedly.\n * The sleep time applied is a random number between
[0-value]. Every time we hit rate limit the value is increased by the
multiplier, up to the limit.\n */\nconst EXECUTOR_SLEEP = {\n
initialValueSeconds: 3,\n multiplier: 2,\n limitSeconds: 96, // 1m36s (5
increases)\n} as const;\n```\n\n### Migration batching
changes\n\n```ts\n/** Number of concurrent rule translations in the pool
*/\nconst TASK_CONCURRENCY = 10 as const;\n/** Number of rules loaded in
memory to be translated in the pool */\nconst TASK_BATCH_SIZE = 100 as
const;\n```\n\n#### Before \n\n- Batches of 15 rules were retrieved and
executed in a `Promise.all`,\nrequiring all of them to be completed
before proceeding to the next\nbatch.\n- A \"batch sleep\" of 10s was
executed at the end of each iteration.\n\n#### In this PR\n\n- Batches
of 100 rules are retrieved and kept in memory. The execution\nis
performed in a task pool with a concurrency of 10 rules. This
ensures\nthere are always 10 rules executing at a time.\n- The \"batch
sleep\" has been removed in favour of an \"execution sleep\"\nof
rand[1-3]s at the start of each single rule migration. This\nindividual
sleep serves two goals:\n - Spread the load when the migration is first
launched.\n- Prevent hitting the rate limit consistently: The sleep
duration is\nincreased every time we hit a rate
limit.\n\n---------\n\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"64426b2b4d99901a01ecef66a17db01049b05f1a","branchLabelMapping":{"^v9.1.0$":"main","^v8.19.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","v9.0.0","Team:Threat
Hunting","backport:version","v8.18.0","v9.1.0","v8.19.0"],"title":"[Security
Solution][Siem migrations] Implement rate limit
backoff","number":211469,"url":"https://github.com/elastic/kibana/pull/211469","mergeCommit":{"message":"[Security
Solution][Siem migrations] Implement rate limit backoff (#211469)\n\n##
Summary\n\nImplements an exponential backoff retry strategy when the LLM
API throws\nrate limit (`429`) errors.\n\n### Backoff
implementation\n\n- The `run` method from the `RuleMigrationsTaskClient`
has been moved to\nthe new `RuleMigrationTaskRunner` class.\n- The
settings for the backoff are defined in this class with:\n```ts\n/**
Exponential backoff configuration to handle rate limit errors */\nconst
RETRY_CONFIG = {\n initialRetryDelaySeconds: 1,\n backoffMultiplier:
2,\n maxRetries: 8,\n // max waiting time 4m15s (1*2^8 = 256s)\n} as
const;\n```\n- Only one rule will be retried at a time, the rest of the
concurrent\nrule translations blocked by the rate limit will await for
the API to\nrecover before attempting the translation
again.\n\n```ts\n/** Executor sleep configuration\n * A sleep time
applied at the beginning of each single rule translation in the
execution pool,\n * The objective of this sleep is to spread the load of
concurrent translations, and prevent hitting the rate limit
repeatedly.\n * The sleep time applied is a random number between
[0-value]. Every time we hit rate limit the value is increased by the
multiplier, up to the limit.\n */\nconst EXECUTOR_SLEEP = {\n
initialValueSeconds: 3,\n multiplier: 2,\n limitSeconds: 96, // 1m36s (5
increases)\n} as const;\n```\n\n### Migration batching
changes\n\n```ts\n/** Number of concurrent rule translations in the pool
*/\nconst TASK_CONCURRENCY = 10 as const;\n/** Number of rules loaded in
memory to be translated in the pool */\nconst TASK_BATCH_SIZE = 100 as
const;\n```\n\n#### Before \n\n- Batches of 15 rules were retrieved and
executed in a `Promise.all`,\nrequiring all of them to be completed
before proceeding to the next\nbatch.\n- A \"batch sleep\" of 10s was
executed at the end of each iteration.\n\n#### In this PR\n\n- Batches
of 100 rules are retrieved and kept in memory. The execution\nis
performed in a task pool with a concurrency of 10 rules. This
ensures\nthere are always 10 rules executing at a time.\n- The \"batch
sleep\" has been removed in favour of an \"execution sleep\"\nof
rand[1-3]s at the start of each single rule migration. This\nindividual
sleep serves two goals:\n - Spread the load when the migration is first
launched.\n- Prevent hitting the rate limit consistently: The sleep
duration is\nincreased every time we hit a rate
limit.\n\n---------\n\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"64426b2b4d99901a01ecef66a17db01049b05f1a"}},"sourceBranch":"main","suggestedTargetBranches":["9.0","8.18","8.x"],"targetPullRequestStates":[{"branch":"9.0","label":"v9.0.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"8.18","label":"v8.18.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v9.1.0","branchLabelMappingKey":"^v9.1.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/211469","number":211469,"mergeCommit":{"message":"[Security
Solution][Siem migrations] Implement rate limit backoff (#211469)\n\n##
Summary\n\nImplements an exponential backoff retry strategy when the LLM
API throws\nrate limit (`429`) errors.\n\n### Backoff
implementation\n\n- The `run` method from the `RuleMigrationsTaskClient`
has been moved to\nthe new `RuleMigrationTaskRunner` class.\n- The
settings for the backoff are defined in this class with:\n```ts\n/**
Exponential backoff configuration to handle rate limit errors */\nconst
RETRY_CONFIG = {\n initialRetryDelaySeconds: 1,\n backoffMultiplier:
2,\n maxRetries: 8,\n // max waiting time 4m15s (1*2^8 = 256s)\n} as
const;\n```\n- Only one rule will be retried at a time, the rest of the
concurrent\nrule translations blocked by the rate limit will await for
the API to\nrecover before attempting the translation
again.\n\n```ts\n/** Executor sleep configuration\n * A sleep time
applied at the beginning of each single rule translation in the
execution pool,\n * The objective of this sleep is to spread the load of
concurrent translations, and prevent hitting the rate limit
repeatedly.\n * The sleep time applied is a random number between
[0-value]. Every time we hit rate limit the value is increased by the
multiplier, up to the limit.\n */\nconst EXECUTOR_SLEEP = {\n
initialValueSeconds: 3,\n multiplier: 2,\n limitSeconds: 96, // 1m36s (5
increases)\n} as const;\n```\n\n### Migration batching
changes\n\n```ts\n/** Number of concurrent rule translations in the pool
*/\nconst TASK_CONCURRENCY = 10 as const;\n/** Number of rules loaded in
memory to be translated in the pool */\nconst TASK_BATCH_SIZE = 100 as
const;\n```\n\n#### Before \n\n- Batches of 15 rules were retrieved and
executed in a `Promise.all`,\nrequiring all of them to be completed
before proceeding to the next\nbatch.\n- A \"batch sleep\" of 10s was
executed at the end of each iteration.\n\n#### In this PR\n\n- Batches
of 100 rules are retrieved and kept in memory. The execution\nis
performed in a task pool with a concurrency of 10 rules. This
ensures\nthere are always 10 rules executing at a time.\n- The \"batch
sleep\" has been removed in favour of an \"execution sleep\"\nof
rand[1-3]s at the start of each single rule migration. This\nindividual
sleep serves two goals:\n - Spread the load when the migration is first
launched.\n- Prevent hitting the rate limit consistently: The sleep
duration is\nincreased every time we hit a rate
limit.\n\n---------\n\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"64426b2b4d99901a01ecef66a17db01049b05f1a"}},{"branch":"8.x","label":"v8.19.0","branchLabelMappingKey":"^v8.19.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Sergi Massaneda <sergi.massaneda@elastic.co>
@semd
Copy link
Contributor Author

semd commented Feb 22, 2025

💚 All backports created successfully

Status Branch Result
9.0
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

semd added a commit to semd/kibana that referenced this pull request Feb 22, 2025
…astic#211469)

## Summary

Implements an exponential backoff retry strategy when the LLM API throws
rate limit (`429`) errors.

### Backoff implementation

- The `run` method from the `RuleMigrationsTaskClient` has been moved to
the new `RuleMigrationTaskRunner` class.
- The settings for the backoff are defined in this class with:
```ts
/** Exponential backoff configuration to handle rate limit errors */
const RETRY_CONFIG = {
  initialRetryDelaySeconds: 1,
  backoffMultiplier: 2,
  maxRetries: 8,
  // max waiting time 4m15s (1*2^8 = 256s)
} as const;
```
- Only one rule will be retried at a time, the rest of the concurrent
rule translations blocked by the rate limit will await for the API to
recover before attempting the translation again.

```ts
/** Executor sleep configuration
 * A sleep time applied at the beginning of each single rule translation in the execution pool,
 * The objective of this sleep is to spread the load of concurrent translations, and prevent hitting the rate limit repeatedly.
 * The sleep time applied is a random number between [0-value]. Every time we hit rate limit the value is increased by the multiplier, up to the limit.
 */
const EXECUTOR_SLEEP = {
  initialValueSeconds: 3,
  multiplier: 2,
  limitSeconds: 96, // 1m36s (5 increases)
} as const;
```

### Migration batching changes

```ts
/** Number of concurrent rule translations in the pool */
const TASK_CONCURRENCY = 10 as const;
/** Number of rules loaded in memory to be translated in the pool */
const TASK_BATCH_SIZE = 100 as const;
```

#### Before

- Batches of 15 rules were retrieved and executed in a `Promise.all`,
requiring all of them to be completed before proceeding to the next
batch.
- A "batch sleep" of 10s was executed at the end of each iteration.

#### In this PR

- Batches of 100 rules are retrieved and kept in memory. The execution
is performed in a task pool with a concurrency of 10 rules. This ensures
there are always 10 rules executing at a time.
- The "batch sleep" has been removed in favour of an "execution sleep"
of rand[1-3]s at the start of each single rule migration. This
individual sleep serves two goals:
  - Spread the load when the migration is first launched.
- Prevent hitting the rate limit consistently: The sleep duration is
increased every time we hit a rate limit.

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit 64426b2)
semd added a commit to semd/kibana that referenced this pull request Feb 22, 2025
…astic#211469)

## Summary

Implements an exponential backoff retry strategy when the LLM API throws
rate limit (`429`) errors.

### Backoff implementation

- The `run` method from the `RuleMigrationsTaskClient` has been moved to
the new `RuleMigrationTaskRunner` class.
- The settings for the backoff are defined in this class with:
```ts
/** Exponential backoff configuration to handle rate limit errors */
const RETRY_CONFIG = {
  initialRetryDelaySeconds: 1,
  backoffMultiplier: 2,
  maxRetries: 8,
  // max waiting time 4m15s (1*2^8 = 256s)
} as const;
```
- Only one rule will be retried at a time, the rest of the concurrent
rule translations blocked by the rate limit will await for the API to
recover before attempting the translation again.

```ts
/** Executor sleep configuration
 * A sleep time applied at the beginning of each single rule translation in the execution pool,
 * The objective of this sleep is to spread the load of concurrent translations, and prevent hitting the rate limit repeatedly.
 * The sleep time applied is a random number between [0-value]. Every time we hit rate limit the value is increased by the multiplier, up to the limit.
 */
const EXECUTOR_SLEEP = {
  initialValueSeconds: 3,
  multiplier: 2,
  limitSeconds: 96, // 1m36s (5 increases)
} as const;
```

### Migration batching changes

```ts
/** Number of concurrent rule translations in the pool */
const TASK_CONCURRENCY = 10 as const;
/** Number of rules loaded in memory to be translated in the pool */
const TASK_BATCH_SIZE = 100 as const;
```

#### Before

- Batches of 15 rules were retrieved and executed in a `Promise.all`,
requiring all of them to be completed before proceeding to the next
batch.
- A "batch sleep" of 10s was executed at the end of each iteration.

#### In this PR

- Batches of 100 rules are retrieved and kept in memory. The execution
is performed in a task pool with a concurrency of 10 rules. This ensures
there are always 10 rules executing at a time.
- The "batch sleep" has been removed in favour of an "execution sleep"
of rand[1-3]s at the start of each single rule migration. This
individual sleep serves two goals:
  - Spread the load when the migration is first launched.
- Prevent hitting the rate limit consistently: The sleep duration is
increased every time we hit a rate limit.

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit 64426b2)
semd added a commit that referenced this pull request Feb 24, 2025
…ff (#211469) (#212177)

# Backport

This will backport the following commits from `main` to `9.0`:
- [[Security Solution][Siem migrations] Implement rate limit backoff
(#211469)](#211469)

<!--- Backport version: 9.6.6 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sorenlouv/backport)

<!--BACKPORT [{"author":{"name":"Sergi
Massaneda","email":"sergi.massaneda@elastic.co"},"sourceCommit":{"committedDate":"2025-02-21T19:54:40Z","message":"[Security
Solution][Siem migrations] Implement rate limit backoff (#211469)\n\n##
Summary\n\nImplements an exponential backoff retry strategy when the LLM
API throws\nrate limit (`429`) errors.\n\n### Backoff
implementation\n\n- The `run` method from the `RuleMigrationsTaskClient`
has been moved to\nthe new `RuleMigrationTaskRunner` class.\n- The
settings for the backoff are defined in this class with:\n```ts\n/**
Exponential backoff configuration to handle rate limit errors */\nconst
RETRY_CONFIG = {\n initialRetryDelaySeconds: 1,\n backoffMultiplier:
2,\n maxRetries: 8,\n // max waiting time 4m15s (1*2^8 = 256s)\n} as
const;\n```\n- Only one rule will be retried at a time, the rest of the
concurrent\nrule translations blocked by the rate limit will await for
the API to\nrecover before attempting the translation
again.\n\n```ts\n/** Executor sleep configuration\n * A sleep time
applied at the beginning of each single rule translation in the
execution pool,\n * The objective of this sleep is to spread the load of
concurrent translations, and prevent hitting the rate limit
repeatedly.\n * The sleep time applied is a random number between
[0-value]. Every time we hit rate limit the value is increased by the
multiplier, up to the limit.\n */\nconst EXECUTOR_SLEEP = {\n
initialValueSeconds: 3,\n multiplier: 2,\n limitSeconds: 96, // 1m36s (5
increases)\n} as const;\n```\n\n### Migration batching
changes\n\n```ts\n/** Number of concurrent rule translations in the pool
*/\nconst TASK_CONCURRENCY = 10 as const;\n/** Number of rules loaded in
memory to be translated in the pool */\nconst TASK_BATCH_SIZE = 100 as
const;\n```\n\n#### Before \n\n- Batches of 15 rules were retrieved and
executed in a `Promise.all`,\nrequiring all of them to be completed
before proceeding to the next\nbatch.\n- A \"batch sleep\" of 10s was
executed at the end of each iteration.\n\n#### In this PR\n\n- Batches
of 100 rules are retrieved and kept in memory. The execution\nis
performed in a task pool with a concurrency of 10 rules. This
ensures\nthere are always 10 rules executing at a time.\n- The \"batch
sleep\" has been removed in favour of an \"execution sleep\"\nof
rand[1-3]s at the start of each single rule migration. This\nindividual
sleep serves two goals:\n - Spread the load when the migration is first
launched.\n- Prevent hitting the rate limit consistently: The sleep
duration is\nincreased every time we hit a rate
limit.\n\n---------\n\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"64426b2b4d99901a01ecef66a17db01049b05f1a","branchLabelMapping":{"^v9.1.0$":"main","^v8.19.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","v9.0.0","Team:Threat
Hunting","backport:version","v8.18.0","v9.1.0","v8.19.0"],"title":"[Security
Solution][Siem migrations] Implement rate limit
backoff","number":211469,"url":"https://github.com/elastic/kibana/pull/211469","mergeCommit":{"message":"[Security
Solution][Siem migrations] Implement rate limit backoff (#211469)\n\n##
Summary\n\nImplements an exponential backoff retry strategy when the LLM
API throws\nrate limit (`429`) errors.\n\n### Backoff
implementation\n\n- The `run` method from the `RuleMigrationsTaskClient`
has been moved to\nthe new `RuleMigrationTaskRunner` class.\n- The
settings for the backoff are defined in this class with:\n```ts\n/**
Exponential backoff configuration to handle rate limit errors */\nconst
RETRY_CONFIG = {\n initialRetryDelaySeconds: 1,\n backoffMultiplier:
2,\n maxRetries: 8,\n // max waiting time 4m15s (1*2^8 = 256s)\n} as
const;\n```\n- Only one rule will be retried at a time, the rest of the
concurrent\nrule translations blocked by the rate limit will await for
the API to\nrecover before attempting the translation
again.\n\n```ts\n/** Executor sleep configuration\n * A sleep time
applied at the beginning of each single rule translation in the
execution pool,\n * The objective of this sleep is to spread the load of
concurrent translations, and prevent hitting the rate limit
repeatedly.\n * The sleep time applied is a random number between
[0-value]. Every time we hit rate limit the value is increased by the
multiplier, up to the limit.\n */\nconst EXECUTOR_SLEEP = {\n
initialValueSeconds: 3,\n multiplier: 2,\n limitSeconds: 96, // 1m36s (5
increases)\n} as const;\n```\n\n### Migration batching
changes\n\n```ts\n/** Number of concurrent rule translations in the pool
*/\nconst TASK_CONCURRENCY = 10 as const;\n/** Number of rules loaded in
memory to be translated in the pool */\nconst TASK_BATCH_SIZE = 100 as
const;\n```\n\n#### Before \n\n- Batches of 15 rules were retrieved and
executed in a `Promise.all`,\nrequiring all of them to be completed
before proceeding to the next\nbatch.\n- A \"batch sleep\" of 10s was
executed at the end of each iteration.\n\n#### In this PR\n\n- Batches
of 100 rules are retrieved and kept in memory. The execution\nis
performed in a task pool with a concurrency of 10 rules. This
ensures\nthere are always 10 rules executing at a time.\n- The \"batch
sleep\" has been removed in favour of an \"execution sleep\"\nof
rand[1-3]s at the start of each single rule migration. This\nindividual
sleep serves two goals:\n - Spread the load when the migration is first
launched.\n- Prevent hitting the rate limit consistently: The sleep
duration is\nincreased every time we hit a rate
limit.\n\n---------\n\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"64426b2b4d99901a01ecef66a17db01049b05f1a"}},"sourceBranch":"main","suggestedTargetBranches":["9.0","8.x"],"targetPullRequestStates":[{"branch":"9.0","label":"v9.0.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"8.18","label":"v8.18.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"url":"https://github.com/elastic/kibana/pull/212154","number":212154,"state":"MERGED","mergeCommit":{"sha":"4bf719063c6015b1a68703cbcafb56a281a4b491","message":"[8.18]
[Security Solution][Siem migrations] Implement rate limit backoff
(#211469) (#212154)\n\n# Backport\n\nThis will backport the following
commits from `main` to `8.18`:\n- [[Security Solution][Siem migrations]
Implement rate limit
backoff\n(#211469)](https://github.com/elastic/kibana/pull/211469)\n\n\n\n###
Questions ?\nPlease refer to the [Backport
tool\ndocumentation](https://github.com/sorenlouv/backport)\n\n\n\nCo-authored-by:
Sergi Massaneda
<sergi.massaneda@elastic.co>"}},{"branch":"main","label":"v9.1.0","branchLabelMappingKey":"^v9.1.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/211469","number":211469,"mergeCommit":{"message":"[Security
Solution][Siem migrations] Implement rate limit backoff (#211469)\n\n##
Summary\n\nImplements an exponential backoff retry strategy when the LLM
API throws\nrate limit (`429`) errors.\n\n### Backoff
implementation\n\n- The `run` method from the `RuleMigrationsTaskClient`
has been moved to\nthe new `RuleMigrationTaskRunner` class.\n- The
settings for the backoff are defined in this class with:\n```ts\n/**
Exponential backoff configuration to handle rate limit errors */\nconst
RETRY_CONFIG = {\n initialRetryDelaySeconds: 1,\n backoffMultiplier:
2,\n maxRetries: 8,\n // max waiting time 4m15s (1*2^8 = 256s)\n} as
const;\n```\n- Only one rule will be retried at a time, the rest of the
concurrent\nrule translations blocked by the rate limit will await for
the API to\nrecover before attempting the translation
again.\n\n```ts\n/** Executor sleep configuration\n * A sleep time
applied at the beginning of each single rule translation in the
execution pool,\n * The objective of this sleep is to spread the load of
concurrent translations, and prevent hitting the rate limit
repeatedly.\n * The sleep time applied is a random number between
[0-value]. Every time we hit rate limit the value is increased by the
multiplier, up to the limit.\n */\nconst EXECUTOR_SLEEP = {\n
initialValueSeconds: 3,\n multiplier: 2,\n limitSeconds: 96, // 1m36s (5
increases)\n} as const;\n```\n\n### Migration batching
changes\n\n```ts\n/** Number of concurrent rule translations in the pool
*/\nconst TASK_CONCURRENCY = 10 as const;\n/** Number of rules loaded in
memory to be translated in the pool */\nconst TASK_BATCH_SIZE = 100 as
const;\n```\n\n#### Before \n\n- Batches of 15 rules were retrieved and
executed in a `Promise.all`,\nrequiring all of them to be completed
before proceeding to the next\nbatch.\n- A \"batch sleep\" of 10s was
executed at the end of each iteration.\n\n#### In this PR\n\n- Batches
of 100 rules are retrieved and kept in memory. The execution\nis
performed in a task pool with a concurrency of 10 rules. This
ensures\nthere are always 10 rules executing at a time.\n- The \"batch
sleep\" has been removed in favour of an \"execution sleep\"\nof
rand[1-3]s at the start of each single rule migration. This\nindividual
sleep serves two goals:\n - Spread the load when the migration is first
launched.\n- Prevent hitting the rate limit consistently: The sleep
duration is\nincreased every time we hit a rate
limit.\n\n---------\n\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"64426b2b4d99901a01ecef66a17db01049b05f1a"}},{"branch":"8.x","label":"v8.19.0","branchLabelMappingKey":"^v8.19.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
semd added a commit that referenced this pull request Feb 24, 2025
…ff (#211469) (#212178)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[Security Solution][Siem migrations] Implement rate limit backoff
(#211469)](#211469)

<!--- Backport version: 9.6.6 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sorenlouv/backport)

<!--BACKPORT [{"author":{"name":"Sergi
Massaneda","email":"sergi.massaneda@elastic.co"},"sourceCommit":{"committedDate":"2025-02-21T19:54:40Z","message":"[Security
Solution][Siem migrations] Implement rate limit backoff (#211469)\n\n##
Summary\n\nImplements an exponential backoff retry strategy when the LLM
API throws\nrate limit (`429`) errors.\n\n### Backoff
implementation\n\n- The `run` method from the `RuleMigrationsTaskClient`
has been moved to\nthe new `RuleMigrationTaskRunner` class.\n- The
settings for the backoff are defined in this class with:\n```ts\n/**
Exponential backoff configuration to handle rate limit errors */\nconst
RETRY_CONFIG = {\n initialRetryDelaySeconds: 1,\n backoffMultiplier:
2,\n maxRetries: 8,\n // max waiting time 4m15s (1*2^8 = 256s)\n} as
const;\n```\n- Only one rule will be retried at a time, the rest of the
concurrent\nrule translations blocked by the rate limit will await for
the API to\nrecover before attempting the translation
again.\n\n```ts\n/** Executor sleep configuration\n * A sleep time
applied at the beginning of each single rule translation in the
execution pool,\n * The objective of this sleep is to spread the load of
concurrent translations, and prevent hitting the rate limit
repeatedly.\n * The sleep time applied is a random number between
[0-value]. Every time we hit rate limit the value is increased by the
multiplier, up to the limit.\n */\nconst EXECUTOR_SLEEP = {\n
initialValueSeconds: 3,\n multiplier: 2,\n limitSeconds: 96, // 1m36s (5
increases)\n} as const;\n```\n\n### Migration batching
changes\n\n```ts\n/** Number of concurrent rule translations in the pool
*/\nconst TASK_CONCURRENCY = 10 as const;\n/** Number of rules loaded in
memory to be translated in the pool */\nconst TASK_BATCH_SIZE = 100 as
const;\n```\n\n#### Before \n\n- Batches of 15 rules were retrieved and
executed in a `Promise.all`,\nrequiring all of them to be completed
before proceeding to the next\nbatch.\n- A \"batch sleep\" of 10s was
executed at the end of each iteration.\n\n#### In this PR\n\n- Batches
of 100 rules are retrieved and kept in memory. The execution\nis
performed in a task pool with a concurrency of 10 rules. This
ensures\nthere are always 10 rules executing at a time.\n- The \"batch
sleep\" has been removed in favour of an \"execution sleep\"\nof
rand[1-3]s at the start of each single rule migration. This\nindividual
sleep serves two goals:\n - Spread the load when the migration is first
launched.\n- Prevent hitting the rate limit consistently: The sleep
duration is\nincreased every time we hit a rate
limit.\n\n---------\n\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"64426b2b4d99901a01ecef66a17db01049b05f1a","branchLabelMapping":{"^v9.1.0$":"main","^v8.19.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","v9.0.0","Team:Threat
Hunting","backport:version","v8.18.0","v9.1.0","v8.19.0"],"title":"[Security
Solution][Siem migrations] Implement rate limit
backoff","number":211469,"url":"https://github.com/elastic/kibana/pull/211469","mergeCommit":{"message":"[Security
Solution][Siem migrations] Implement rate limit backoff (#211469)\n\n##
Summary\n\nImplements an exponential backoff retry strategy when the LLM
API throws\nrate limit (`429`) errors.\n\n### Backoff
implementation\n\n- The `run` method from the `RuleMigrationsTaskClient`
has been moved to\nthe new `RuleMigrationTaskRunner` class.\n- The
settings for the backoff are defined in this class with:\n```ts\n/**
Exponential backoff configuration to handle rate limit errors */\nconst
RETRY_CONFIG = {\n initialRetryDelaySeconds: 1,\n backoffMultiplier:
2,\n maxRetries: 8,\n // max waiting time 4m15s (1*2^8 = 256s)\n} as
const;\n```\n- Only one rule will be retried at a time, the rest of the
concurrent\nrule translations blocked by the rate limit will await for
the API to\nrecover before attempting the translation
again.\n\n```ts\n/** Executor sleep configuration\n * A sleep time
applied at the beginning of each single rule translation in the
execution pool,\n * The objective of this sleep is to spread the load of
concurrent translations, and prevent hitting the rate limit
repeatedly.\n * The sleep time applied is a random number between
[0-value]. Every time we hit rate limit the value is increased by the
multiplier, up to the limit.\n */\nconst EXECUTOR_SLEEP = {\n
initialValueSeconds: 3,\n multiplier: 2,\n limitSeconds: 96, // 1m36s (5
increases)\n} as const;\n```\n\n### Migration batching
changes\n\n```ts\n/** Number of concurrent rule translations in the pool
*/\nconst TASK_CONCURRENCY = 10 as const;\n/** Number of rules loaded in
memory to be translated in the pool */\nconst TASK_BATCH_SIZE = 100 as
const;\n```\n\n#### Before \n\n- Batches of 15 rules were retrieved and
executed in a `Promise.all`,\nrequiring all of them to be completed
before proceeding to the next\nbatch.\n- A \"batch sleep\" of 10s was
executed at the end of each iteration.\n\n#### In this PR\n\n- Batches
of 100 rules are retrieved and kept in memory. The execution\nis
performed in a task pool with a concurrency of 10 rules. This
ensures\nthere are always 10 rules executing at a time.\n- The \"batch
sleep\" has been removed in favour of an \"execution sleep\"\nof
rand[1-3]s at the start of each single rule migration. This\nindividual
sleep serves two goals:\n - Spread the load when the migration is first
launched.\n- Prevent hitting the rate limit consistently: The sleep
duration is\nincreased every time we hit a rate
limit.\n\n---------\n\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"64426b2b4d99901a01ecef66a17db01049b05f1a"}},"sourceBranch":"main","suggestedTargetBranches":["9.0","8.x"],"targetPullRequestStates":[{"branch":"9.0","label":"v9.0.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"8.18","label":"v8.18.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"url":"https://github.com/elastic/kibana/pull/212154","number":212154,"state":"MERGED","mergeCommit":{"sha":"4bf719063c6015b1a68703cbcafb56a281a4b491","message":"[8.18]
[Security Solution][Siem migrations] Implement rate limit backoff
(#211469) (#212154)\n\n# Backport\n\nThis will backport the following
commits from `main` to `8.18`:\n- [[Security Solution][Siem migrations]
Implement rate limit
backoff\n(#211469)](https://github.com/elastic/kibana/pull/211469)\n\n\n\n###
Questions ?\nPlease refer to the [Backport
tool\ndocumentation](https://github.com/sorenlouv/backport)\n\n\n\nCo-authored-by:
Sergi Massaneda
<sergi.massaneda@elastic.co>"}},{"branch":"main","label":"v9.1.0","branchLabelMappingKey":"^v9.1.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/211469","number":211469,"mergeCommit":{"message":"[Security
Solution][Siem migrations] Implement rate limit backoff (#211469)\n\n##
Summary\n\nImplements an exponential backoff retry strategy when the LLM
API throws\nrate limit (`429`) errors.\n\n### Backoff
implementation\n\n- The `run` method from the `RuleMigrationsTaskClient`
has been moved to\nthe new `RuleMigrationTaskRunner` class.\n- The
settings for the backoff are defined in this class with:\n```ts\n/**
Exponential backoff configuration to handle rate limit errors */\nconst
RETRY_CONFIG = {\n initialRetryDelaySeconds: 1,\n backoffMultiplier:
2,\n maxRetries: 8,\n // max waiting time 4m15s (1*2^8 = 256s)\n} as
const;\n```\n- Only one rule will be retried at a time, the rest of the
concurrent\nrule translations blocked by the rate limit will await for
the API to\nrecover before attempting the translation
again.\n\n```ts\n/** Executor sleep configuration\n * A sleep time
applied at the beginning of each single rule translation in the
execution pool,\n * The objective of this sleep is to spread the load of
concurrent translations, and prevent hitting the rate limit
repeatedly.\n * The sleep time applied is a random number between
[0-value]. Every time we hit rate limit the value is increased by the
multiplier, up to the limit.\n */\nconst EXECUTOR_SLEEP = {\n
initialValueSeconds: 3,\n multiplier: 2,\n limitSeconds: 96, // 1m36s (5
increases)\n} as const;\n```\n\n### Migration batching
changes\n\n```ts\n/** Number of concurrent rule translations in the pool
*/\nconst TASK_CONCURRENCY = 10 as const;\n/** Number of rules loaded in
memory to be translated in the pool */\nconst TASK_BATCH_SIZE = 100 as
const;\n```\n\n#### Before \n\n- Batches of 15 rules were retrieved and
executed in a `Promise.all`,\nrequiring all of them to be completed
before proceeding to the next\nbatch.\n- A \"batch sleep\" of 10s was
executed at the end of each iteration.\n\n#### In this PR\n\n- Batches
of 100 rules are retrieved and kept in memory. The execution\nis
performed in a task pool with a concurrency of 10 rules. This
ensures\nthere are always 10 rules executing at a time.\n- The \"batch
sleep\" has been removed in favour of an \"execution sleep\"\nof
rand[1-3]s at the start of each single rule migration. This\nindividual
sleep serves two goals:\n - Spread the load when the migration is first
launched.\n- Prevent hitting the rate limit consistently: The sleep
duration is\nincreased every time we hit a rate
limit.\n\n---------\n\nCo-authored-by: kibanamachine
<42973632+kibanamachine@users.noreply.github.com>","sha":"64426b2b4d99901a01ecef66a17db01049b05f1a"}},{"branch":"8.x","label":"v8.19.0","branchLabelMappingKey":"^v8.19.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:version Backport to applied version labels release_note:skip Skip the PR/issue when compiling release notes Team:Threat Hunting Security Solution Threat Hunting Team v8.18.0 v8.19.0 v9.0.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants