Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random failed tests on routeros devices #121

Closed
hart323 opened this issue Sep 26, 2024 · 7 comments
Closed

Random failed tests on routeros devices #121

hart323 opened this issue Sep 26, 2024 · 7 comments

Comments

@hart323
Copy link

hart323 commented Sep 26, 2024

Validity Version

3.0.3

NetBox Version

4.1.1

Python Version

3.12.3

Steps to Reproduce

As amount of tests for routeros and number of routeros devices started to increase (8 tests/33 devices as of now) I noticed that I get random failed tests. For example if I run 8 tests/1 device I get result with all tests passed, but if I run 8 tests/33 devices I end up with around 80% of tests passed but others failed.
This is how particular test for particular device may look when scheduled 8/33 tests have been done:
изображение

It seems like failed tests are because of netmiko or validity collecting results from netmiko because command result file may contain the following lines:

POLLING ERROR
AssertionError: 

or trimmed output like:

             enabled: yes
         primary-ntp: 10.10.10.10
       secondary-ntp: 10.10.10.11
    server-dns-names: 
                mode: unicast
       poll-interval: 15m
       active-server: 10.10.10.10
    last-update-from: 10.10.10.10
  las

or some other strange lines which looks like lines from mikrotik config but not related to command sent to device. I attach this example when I find one.

I added to routeros poller config line: session_log: /tmp/netmiko_session.log and this log file contains all command outputs from device however device result file does not contain data.

Note. I don't notice same behavior on cisco.ios devices.

Traceback

No tracebacks
@amyasnikov
Copy link
Owner

Let's break it down:

  1. There are 2 separate processes inside Validity: tests execution and device polling (although they can be executed together for the sake of convenience, in general they have nothing to do with each other). So, the former (tests execution) seems to be completely unrelated to the issue.
  2. There is some problem with netmiko (or the network itself, network_issue.jpg) as I can see. AssertionError likely means the output received from a device is empty. SSH/telnet are not so reliable in terms of network automation as you may know.
  3. Presence of the answer (that seems correct) in the logs and absence of it in the polling result might mean netmiko could not find the trailing prompt ([MyMikrotik] > ). I'm not sure here, just a guess.

Hence:

  1. I can't make it fixed unless it becomes somehow reproducible (e.g. you'd describe the exact algorithm for reproducing it with clean netbox/validity setup and some generic mikrotik).
  2. Try to play with timeouts (via poller params)
  3. I'm not sure if netmiko disables colors. So you can try to do it by yourself (use myuser+ct as your username)
  4. You can try to reproduce it without Validity: just using netmiko directly from python shell

@hart323
Copy link
Author

hart323 commented Sep 27, 2024

It seems like this is solely mikrotik-netmiko issue. I found some similar issues:
ktbyers/netmiko#2880
ktbyers/netmiko#2512
Have to figure out what to do with this.
Issue closed.

@hart323 hart323 closed this as completed Sep 27, 2024
@hart323 hart323 reopened this Oct 3, 2024
@hart323
Copy link
Author

hart323 commented Oct 3, 2024

I did a little investigation and got the following.
Plain netmiko script with a list of commands pushed to send_command indeed somehow does not have time to handle the input/output of the ROS and I got empty responses on random commands.
I found that netmiko has another delay-based mechanism to send commands with send_command_timing which is used primarily for show commands.
(https://ktbyers.github.io/netmiko/docs/netmiko/index.html#netmiko.BaseConnection.send_command_timing).

So I replaced

return driver.send_command(command.parameters["cli_command"])

with
return driver.send_command_timing(command.parameters["cli_command"], last_read=0.1, read_timeout=30, cmd_verify=True)
and first time I got all my tests succeed.

Maybe you can extend validity poller to give possibility to specify which method/arguments to run or just change send_command with send_command_timing?

@hart323
Copy link
Author

hart323 commented Oct 3, 2024

Oops, Mikrotik tests succeed but now Cisco tests randomly failed :(

@hart323
Copy link
Author

hart323 commented Oct 3, 2024

Reverted code to original send_command(), but in the poller settings I added 2 options:

{
  "disable_lf_normalization": "True",
  "global_cmd_verify": "True"
}

Short tests on routeros and ios now return valid results. Let's see how things go further.

@amyasnikov
Copy link
Owner

Maybe you can extend validity poller to give possibility to specify which method/arguments to run or just change send_command with send_command_timing?

I can consider adding custom user-defined pollers via plugin settings

@amyasnikov
Copy link
Owner

I've implemented custom user-defined pollers mechanism. So, in the next version you'll be able to define your own poller (e.g. based on the current netmiko poller) with the parameters suitable for your case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants