Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance analysis of cascade operations #660

3 tasks done
Tracked by #27434
AlexRuiz7 opened this issue Jan 29, 2025 · 7 comments
3 tasks done
Tracked by #27434

Performance analysis of cascade operations #660

AlexRuiz7 opened this issue Jan 29, 2025 · 7 comments
level/task Task issue mvp Minimum Viable Product type/research Research issue


Copy link

AlexRuiz7 commented Jan 29, 2025


As a preliminary step towards migrating Wazuh's RBAC from the Server to the Indexer, we need to be aware about the performance of the Indexer on cascade operations involving the change of agents' groups.

A single agent can generate hundreds to thousands of events that end up on indexes. These documents (events) are tied to a single agent, comprising a one-to-one relationship, meaning that a document in an index can only belong to an agent. In order to depict this relationship in the indices, every document contains the as a primary key that allows these entities to be correlated. Every document also has the field agent.groups to:

  • allow groups-based queries over the documents.
  • allow filtering the document based on RBAC (roles that can read specific groups of agents only).

The main drawback of this design is that when any agent changes its groups, all the data belonging to that agent until that moment needs to be updated with the new groups of the agent.

To better understand the problem, let's imagine an environment with 50K agents and 20K documents per agent.

n_docs = 50.000 * 20.000 = documents


  • For Windows agents: 50K * 25K
  • For Linux and macOS: 50K * 15k

Over a month, such an environment would have 1K million documents. On a hypothetical, but possible, update of every agent's group, the Indexer would need to perform 1K million update operations as a result.

Environment details

  • Comprised by 50K agents.
  • Agents distribution by OS:
    • 50% Windows
    • 15% macOS
    • 35% Linux
  • Number of groups per agent: 128


  • Simulate the environment.
  • Simulate an update on every agent's group (worst case scenario).
  • Measure performance of the Indexer during the update of the documents.
@AlexRuiz7 AlexRuiz7 added level/task Task issue type/research Research issue labels Jan 29, 2025
@AlexRuiz7 AlexRuiz7 changed the title Performance analysis for cascade operations on data heavy environments Performance analysis of cascade operations Jan 29, 2025
@wazuhci wazuhci moved this to Backlog in XDR+SIEM/Release 5.0.0 Jan 29, 2025
@AlexRuiz7 AlexRuiz7 added the mvp Minimum Viable Product label Jan 29, 2025
@wazuhci wazuhci moved this from Backlog to In progress in XDR+SIEM/Release 5.0.0 Jan 30, 2025
Copy link

QU3B1M commented Jan 30, 2025

For the environment configuration, I'm using a modified version of the agents index event generator utility (ecs/agent/event-generator/ to ensure the OS and groups distribution.

The groups' generation is set to a max of 500 groups, from which each agent will have assigned 128 different groups

# Generate 500 unique group names
unique_groups = [f'group{i}' for i in range(500)]

And the OS distribution is calculated based on the number of events to generate, in this case we will use 50k events

def generate_random_data(number):
    data = []
    num_windows = int(0.5 * number)
    num_macos = int(0.15 * number)
    num_linux = number - num_windows - num_macos

Some other modifications were made to meet these requirements, I'm sharing the complete script below

event generator script

import datetime
import json
import logging
import random
import requests
import urllib3

# Constants and Configuration
LOG_FILE = 'generate_data.log'
GENERATED_DATA_FILE = 'generatedData.json'
DATE_FORMAT = "%Y-%m-%dT%H:%M:%S.%fZ"

# Default values
INDEX_NAME = "wazuh-agents"

# Configure logging
logger = logging.getLogger("DataGenerator")
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')

file_handler = logging.FileHandler(LOG_FILE)

# Suppress warnings

# Generate 500 unique group names
unique_groups = [f'group{i}' for i in range(500)]

def generate_random_date():
    """Generate a random date within the last 10 days."""
    start_date =
    end_date = start_date - datetime.timedelta(days=10)
    random_date = start_date + (end_date - start_date) * random.random()
    return random_date.strftime(DATE_FORMAT)

def generate_random_groups():
    """Return a list of randomly sampled groups."""
    return random.sample(unique_groups, 128)

def generate_random_host(agent_type):
    """Generate a random host configuration."""
    os_families = {
        'linux': ['debian', 'ubuntu', 'centos', 'redhat'],
        'windows': ['windows'],
        'macos': ['macos', 'ios']
    family = random.choice(os_families[agent_type])
    version = f'{random.randint(0, 99)}.{random.randint(0, 99)}'

    return {
        'architecture': random.choice(['x86_64', 'arm64']),
        'boot': {'id': f'boot{random.randint(0, 9999)}'},
        'cpu': {'usage': random.uniform(0, 100)},
        'disk': {
            'read': {'bytes': random.randint(0, 1_000_000)},
            'write': {'bytes': random.randint(0, 1_000_000)}
        'domain': f'domain{random.randint(0, 999)}',
        'geo': {
            'city_name': random.choice(['San Francisco', 'New York', 'Berlin', 'Tokyo']),
            'continent_code': random.choice(['NA', 'EU', 'AS']),
            'continent_name': random.choice(['North America', 'Europe', 'Asia']),
            'country_iso_code': random.choice(['US', 'DE', 'JP']),
            'country_name': random.choice(['United States', 'Germany', 'Japan']),
            'location': {
                'lat': round(random.uniform(-90.0, 90.0), 6),
                'lon': round(random.uniform(-180.0, 180.0), 6)
            'postal_code': f'{random.randint(10000, 99999)}',
            'region_name': f'Region {random.randint(0, 999)}',
            'timezone': random.choice(['PST', 'EST', 'CET', 'JST'])
        'hostname': f'host{random.randint(0, 9999)}',
        'id': f'hostid{random.randint(0, 9999)}',
        'ip': ".".join(str(random.randint(1, 255)) for _ in range(4)),
        'mac': ":".join(f'{random.randint(0, 255):02x}' for _ in range(6)),
        'os': {
            'family': family,
            'full': f'{family} {version}',
            'kernel': f'kernel{random.randint(0, 999)}',
            'name': family,
            'platform': agent_type,
            'type': agent_type,
            'version': version
        'uptime': random.randint(0, 1_000_000)

def generate_random_agent(agent_type):
    """Generate a random agent configuration."""
    agent_id = random.randint(0, 99999)
    return {
        'id': f'agent{agent_id}',
        'name': f'Agent{agent_id}',
        'type': agent_type,
        'version': f'v{random.randint(0, 9)}-stable',
        'status': random.choice(['active', 'inactive']),
        'last_login': generate_random_date(),
        'groups': generate_random_groups(),
        'key': f'key{agent_id}',
        'host': generate_random_host(agent_type)

def generate_random_data(number):
    """Generate a list of random agent events."""
    data = []
    num_windows = int(0.5 * number)
    num_macos = int(0.15 * number)
    num_linux = number - num_windows - num_macos

    for _ in range(num_windows):
        data.append({'agent': generate_random_agent('windows')})
    for _ in range(num_macos):
        data.append({'agent': generate_random_agent('macos')})
    for _ in range(num_linux):
        data.append({'agent': generate_random_agent('linux')})

    return data

def inject_events(cluster_url, username, password, data):
    """Send generated data to the indexer."""
    url = f'{cluster_url}/{INDEX_NAME}/_doc'
    session = requests.Session()
    session.auth = (username, password)
    session.verify = False
    headers = {'Content-Type': 'application/json'}

        for event_data in data:
            response =, json=event_data, headers=headers)
            if response.status_code != 201:
                logger.error(f'Failed to inject event. Status Code: {response.status_code}')
                break'Data injection completed successfully.')
    except requests.RequestException as e:
        logger.error(f'Error during data injection: {e}')

def save_generated_data(data):
    """Save generated data to a file."""
        with open(GENERATED_DATA_FILE, 'w') as outfile:
            json.dump(data, outfile, indent=2)"Generated data saved successfully.")
    except IOError as e:
        logger.error(f"Error saving data to file: {e}")

def get_user_input(prompt, default):
    """Get user input with a default fallback."""
    return input(f"{prompt} (default: '{default}'): ") or default

def main():
    """Main function to generate and inject data."""
        number = int(input("How many events do you want to generate? "))
    except ValueError:
        logger.error("Invalid input. Please enter a valid number.")

    ip = get_user_input("Enter the IP of your Indexer", DEFAULT_IP)
    port = get_user_input("Enter the port of your Indexer", DEFAULT_PORT)
    username = get_user_input("Username", DEFAULT_USERNAME)
    password = get_user_input("Password", DEFAULT_PASSWORD)"Generating {number} events...")
    cluster_url = f"http://{ip}:{port}"

    data = generate_random_data(number)
    inject_events(cluster_url, username, password, data)

if __name__ == "__main__":

For the generation of the documents we can use the commands endpoint and post a number of commands for each agent, this commands will be indexed as documents with the agents mapping on the wazuh-commands index. Then we can use these events to test the performance

The documents are generated with a modified version of the states-inventory-packages script.

Script to use for the events generation
import datetime
import json
import uuid
import requests
import logging

# Configuration
INDEX_AGENTS = "wazuh-agents"
INDEX_DOCUMENT = "wazuh-states-inventory-packages/_bulk"

# Configure logging
    format='%(asctime)s - %(levelname)s - %(message)s',
        "agent_documents.log"), logging.StreamHandler()]

def get_agents_ids(cluster_url, user, password) -> list:
    """Fetch agent IDs from the Wazuh index."""
    agents_url = f"{cluster_url}/{INDEX_AGENTS}/_search"
        response = requests.get(agents_url, auth=(user, password))
        agents_data = response.json()
        agents = [hit['_source']['agent'] for hit in agents_data.get('hits', {}).get('hits', [])]"Retrieved {len(agents)} agents.")
        return agents
    except requests.exceptions.RequestException as e:
        logging.error(f"Error fetching agents: {e}")
        return []

def prepare_bulk_payload(documents):
    payload = ""
    for doc in documents:
        metadata = {"index": {"_index": INDEX_DOCUMENT, "_id": doc.get('agent', {}).get('id')}}
        payload += json.dumps(metadata) + "\n"
        payload += json.dumps(doc) + "\n"
    return payload

def send_documents(cluster_url, user, password, agents, num_documents):
    """Send restart documents to a list of agent IDs."""
    documents_url = f"{cluster_url}/{INDEX_DOCUMENT}"
    documents = [
            "agent": {
                "id": agent.get('id'),
                "name": agent.get('name'),
                "groups": agent.get('groups'),
                "type": agent.get('type'),
                "version": agent.get('version'),
                "host": {
                    "architecture": agent.get('host', {}).get('architecture'),
                    "hostname": agent.get('host', {}).get('hostname'),
                    "ip": agent.get('host', {}).get('ip'),
                    "os": {
                        "name": agent.get('host', {}).get('os', {}).get('name'),
                        "type": agent.get('host', {}).get('os', {}).get('type'),
                        "version": agent.get('host', {}).get('os', {}).get('version')
            "package": {
                "architecture": agent.get('host', {}).get('architecture'),
                "description": "tmux is a \"terminal multiplexer.\"  It enables a number of terminals (or \
                                windows) to be accessed and controlled from a single terminal.  tmux is \
                                intended to be a simple, modern, BSD-licensed alternative to programs such \
                                as GNU Screen.",
                "installed": "1738151465",
                "name": "tmux",
                "path": " ",
                "size": 1166902,
                "type": "rpm",
                "version": "3.2a-5.el9"
        for agent in agents
        for _ in range(num_documents)

    if not documents:
        logging.warning("No documents generated to send.")

    headers = {'Content-Type': 'application/json'}
        response =,
                                 auth=(user, password))
        response.raise_for_status()"Successfully sent {len(documents)} documents.")
    except requests.exceptions.RequestException as e:
        logging.error(f"Error sending documents: {e}")
        logging.error("response: %s", response.text)
    return documents

def main():
    """Main function to retrieve agents and send documents."""
    ip = input(f"Enter the IP of your Indexer (default: '{DEFAULT_IP}'): ") or DEFAULT_IP
    port = input(f"Enter the port of your Indexer (default: '{DEFAULT_PORT}'): ") or DEFAULT_PORT
    username = input(f"Enter username (default: '{DEFAULT_USERNAME}'): ") or DEFAULT_USERNAME
    password = input(f"Enter password (default: '{DEFAULT_PASSWORD}'): ") or DEFAULT_PASSWORD

    cluster_url = f"http://{ip}:{port}"

        num_documents = int(
            input("Enter the number of documents to generate for each agent: "))
    except ValueError:
        logging.error("Invalid input. Please enter a valid number.")

    agent_ids = get_agents_ids(cluster_url, username, password)
    if agent_ids:
        send_documents(cluster_url, username, password, agent_ids, num_documents)
        logging.warning("No agents found to send documents to.")

if __name__ == "__main__":

Copy link

QU3B1M commented Feb 3, 2025

Currently facing some issues while trying to run the update_by_query action, it raises a no handler found error, and it happens on any index, it seems to be because of some configuration (Research in progress)

% curl -XPOST
{"error":"no handler found for uri [/wazuh-agents/_update_by_query] and method [POST]"}

Copy link
Member Author

@QU3B1M the request needs a body

POST test-index1/_update_by_query
  "query": {
    "term": {
      "oldValue": 10
  "script" : {
    "source": "ctx._source.oldValue += params.newValue",
    "lang": "painless",
    "params" : {
      "newValue" : 20

Copy link

QU3B1M commented Feb 5, 2025

Due to the large amount of documents required, I've tried with different approaches to generate that amount of data without having the system crashing, the currently used approach is to index the documents in batches, anyway it takes some extra time.

The generation of 300.000 documents for each agent (10.000 * 30) is problematic for the testing environment, if we want to generate the data locally and push (even in smaller amounts multiple times) wights too much, and if we generate the data and index it in batches it takes too long (last execution was 24hs long and it was at the 30% of the total events).

So, the number of generated events is reduced, and for the generation we are using this code that let us generate events in parallel.

@wazuhci wazuhci moved this from In progress to Blocked in XDR+SIEM/Release 5.0.0 Feb 11, 2025
@wazuhci wazuhci moved this from Blocked to In progress in XDR+SIEM/Release 5.0.0 Feb 12, 2025
Copy link

QU3B1M commented Feb 12, 2025

Requested cloud environment to perform the tests, currently working on the performance tests execution.
It was not possible to run the tests locally for the large amount of space taken by the generated documents, and the resources consumed on the system.

Copy link

QU3B1M commented Feb 13, 2025

Test results on cloud environment

First execution with 3,000 documents per agent

Indexer documents setup

  • Agents documents: 50.000
  • States Packages documents: 150.000.000

Status of the wazuh-states-inventory-packages index pre-update

% curl  "" -u admin:admin -k 
green open wazuh-states-inventory-packages 94segLMyQRKBRs521pSW9Q 1 0 150000000 0 71.8gb 71.8gb

Documents update

Each group was updated with a variation of this following request

% curl -X POST "https://<CLUSTER_IP>:9200/wazuh-states-inventory-packages/_update_by_query?conflicts=proceed&slices=auto&wait_for_active_shards=all" -u admin:admin -k -H "Content-Type: application/json" -d '{
    "profile": true,
    "timeout": "60m",
    "query": {
        "match": {
            "agent.groups": "XXX"
    "script": {
        "source": "ctx._source.agent.groups = params.newValue",
        "lang": "painless",
        "params": {
            "newValue": "new_XXX"

All the updates were executed at the same time, the Windows and Linux are taking more than 12hs to finish.

Windows documents update

  • System crashed before completion

Linux documents update

  • System crashed before completion

macOS documents update

  • Time taken in HS: 7.4886466667

  • Updated documents: 22.500.000

  • Errors: 0

  • Complete request
    % curl -X POST "" -u admin:admin -k -H "Content-Type: application/json" -d '{"profile": true,"timeout": "60m","query": {"match": {"agent.groups": "macos"}},"script": {"source": "ctx._source.agent.groups = params.newValue","lang": "painless","params": {"newValue": "new_macos"}}}'
        "took": 26959128,
        "timed_out": false,
        "total": 22500000,
        "updated": 22500000,
        "deleted": 0,
        "batches": 22500,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
            "bulk": 0,
            "search": 0
        "throttled_millis": 0,
        "requests_per_second": -1.0,
        "throttled_until_millis": 0,
        "failures": []

Cluster status during the update_by_query execution

The size of the wazuh-states-inventory-packages index started to grow since the update started

  • Initial status
    curl  "" -u admin:admin -k 
    green open wazuh-states-inventory-packages 94segLMyQRKBRs521pSW9Q 1 0 150000000 0 71.8gb 71.8gb
  • A few hours after starting the execution
    curl  "" -u admin:admin -k 
    green open wazuh-states-inventory-packages 94segLMyQRKBRs521pSW9Q 1 0 150000000 17141069 88gb 88gb
  • The day after the execution starts
    curl  "" -u admin:admin -k 
    green open wazuh-states-inventory-packages 94segLMyQRKBRs521pSW9Q 1 0 150000000 28311008 132.3gb 132.3gb


The documents amount were reduced more than 10x since the last execution to avoid reproducing the previously experienced system crashes.


x3 Wazuh Indexer nodes 5.0 running on AWS EC2 systems with the same specs:

  • OS: Ubuntu 22.04
  • CPU: 8x
  • RAM 16GB
  • Storage: 230GB

Indexer documents setup

  • Agents documents: 50,000
  • Packages per agent: 200
  • States Packages documents: 10,000,000


% curl -k -u admin:admin
ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role node.roles                                        cluster_manager name           37          75   0    0.00    0.00     0.00 dimr      cluster_manager,data,ingest,remote_cluster_client -               node-3           21          75   0    0.00    0.02     0.01 dimr      cluster_manager,data,ingest,remote_cluster_client *               node-2           52          73  21    0.30    1.17     1.77 dimr      cluster_manager,data,ingest,remote_cluster_client -               node-1

Packages index before _update_by_query

% curl -k -u admin:admin
green open wazuh-states-inventory-packages bM3m6eeGRJmXhiND6H2F4w 1 0 10000000 0 5.3gb 5.3gb

Documents update

For the documents' groups update, we used the Update by query API with this simple body that matches the documents by groups and update it with the new value

    "profile": true,
    "query": {
        "match": {
            "agent.groups": "[windows|linux|macos]"
    "script": {
        "source": "ctx._source.agent.groups = params.newValue",
        "lang": "painless",
        "params": {
            "newValue": "[new_group]"


  • Windows
    • Total documents: 5,000,000

    • Updated documents: 5,000,000

    • Time taken: 25.3 Minutes

    • Errors: None

    • Complete request
      % curl -XPOST -k -u admin:admin "" -H 'Content-Type: application/json' -d '{"profile": true,"query": {"match": {"agent.groups": "windows"}},"script" : {"source": "ctx._source.agent.groups = params.newValue","lang": "painless","params": {"newValue": "new_windows"}}}'
          "took": 1518549,
          "timed_out": false,
          "total": 5000000,
          "updated": 5000000,
          "deleted": 0,
          "batches": 5000,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
              "bulk": 0,
              "search": 0
          "throttled_millis": 0,
          "requests_per_second": -1.0,
          "throttled_until_millis": 0,
          "failures": []
  • Linux
    • Total documents: 3,500,000

    • Updated documents: 3,500,000

    • Time taken: 14.3 Minutes

    • Errors: None

    • Complete request
      % curl -XPOST -k -u admin:admin "" -H 'Content-Type: application/json' -d '{"profile": true,"query": {"match": {"agent.groups": "linux"}},"script" : {"source": "ctx._source.agent.groups = params.newValue","lang": "painless","params": {"newValue": "new_linux"}}}'
          "took": 860136,
          "timed_out": false,
          "total": 3500000,
          "updated": 3500000,
          "deleted": 0,
          "batches": 3500,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
              "bulk": 0,
              "search": 0
          "throttled_millis": 0,
          "requests_per_second": -1.0,
          "throttled_until_millis": 0,
          "failures": []
  • macOS
    • Total documents: 1,500,000

    • Updated documents: 1,500,000

    • Time taken: 4.5 Minutes

    • Errors: None

    • Complete request
      % curl -XPOST -k -u admin:admin "" -H 'Content-Type: application/json' -d '{"profile": true,"query": {"match": {"agent.groups": "macos"}},"script" : {"source": "ctx._source.agent.groups = params.newValue","lang": "painless","params": {"newValue": "new_macos"}}}'
          "took": 269954,
          "timed_out": false,
          "total": 1500000,
          "updated": 1500000,
          "deleted": 0,
          "batches": 1500,
          "version_conflicts": 0,
          "noops": 0,
          "retries": {
              "bulk": 0,
              "search": 0
          "throttled_millis": 0,
          "requests_per_second": -1.0,
          "throttled_until_millis": 0,
          "failures": []


All the queries were executed using the parameters slices=auto and wait_for_active_shards=all that enables the parallelization of the update process taking adventage of the three cluster's nodes.

@wazuhci wazuhci moved this from In progress to Blocked in XDR+SIEM/Release 5.0.0 Feb 14, 2025
@wazuhci wazuhci moved this from Blocked to In progress in XDR+SIEM/Release 5.0.0 Feb 14, 2025
@wazuhci wazuhci moved this from In progress to Pending review in XDR+SIEM/Release 5.0.0 Feb 15, 2025
@wazuhci wazuhci moved this from Pending review to Blocked in XDR+SIEM/Release 5.0.0 Feb 17, 2025
@wazuhci wazuhci moved this from Blocked to On hold in XDR+SIEM/Release 5.0.0 Feb 17, 2025
Copy link
Member Author

AlexRuiz7 commented Feb 18, 2025

We consider this issue complete as the performance analysis has concluded. The poor performance results lead us to explore other alternatives, such as Index Transforms.

See #694.

@wazuhci wazuhci moved this from On hold to Done in XDR+SIEM/Release 5.0.0 Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
level/task Task issue mvp Minimum Viable Product type/research Research issue
Status: Done

No branches or pull requests

2 participants