Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve rpki validation #62

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jurraca
Copy link
Contributor

@jurraca jurraca commented Feb 18, 2025

TLDR: by passing multiple ROA files to the -f files, it's about 2x faster. This PR removes parallelization via ThreadPoolExecutor() which may result in even more gains.

Putting this up for concept ACK (@fjahr), while the impl works and reproduces existing validation, need to run more sanity checks as the output of rpki_raw.json is fairly different.

Impl

The rpki-client '-f' flag can handle multiple ROA files 1. Instead of passing files one by one, which means invoking the rpki-client via a subprocess call, we pass them in batches of 250.
This means we get back a binary blob of concatenated JSON, without delimiters. We match on the separator between items b"\n}\n{\n\t" (which is unique to defining the end of one item and the beginning of another), and insert a JSON-valid separator.
Since now we just have JSON, we can write to a JSON file for each batch without building up results in memory.
I removed the parallelism via ThreadPoolExecutor since it seemed to be mostly a feature to speed up the previous implementation. I couldn't get it to work with the new setup but didn't look too hard, it may be worth exploring to get some more performance improvements. On my machine, this implementation is a little more than 2x faster than the current. On my machine, the non-parallelized version is about 2x faster, and the parallelized version another 7 or 8x faster.

Testing

I noticed some significant differences in the intermediate rpki_raw.json file, but these didn't seem to impact the rpki_final.txt file. i.e. the RPKI validation results in the same output. Notably, this approach seems to reduce duplicates in the rpki_raw.json.
To test, you want to have an existing run with the old version (or run one yourself), and then run a reproduction run with this branch. I got different hashes for the rpki_final.txt but the content was identical. I'm not sure why this is, and I'm still checking through this. I ran runs and got the same rpki_final.txt hashes across different reproductions runs. I did get different final_result.txt hashes but I don't think it's related to this change (to be confirmed).

This is rebased on #61 (bugfix).


json_results = [result.decode() for result in results if result]
total = len(files)
batch_size = 250
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should test more batch size values. 250 worked out better than 100 and 500, so I stuck with it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already a great win, it can be optimized further later. Let's first make sure it's robust and get it in asap :)

@jurraca jurraca marked this pull request as draft February 18, 2025 16:56
@jurraca jurraca force-pushed the improve-rpki-validation branch from c6e995a to 1f5eaa5 Compare February 20, 2025 13:26
normalized = result.replace(b"\n}\n{\n\t", b"\n},\n{\n").decode('utf-8').strip()
res_file.write(normalized)
completed += 1
if completed < total_batches:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping track of completed vs total, so that we write a comma between each batch except after the last batch, which would result in invalid JSON.

@jurraca
Copy link
Contributor Author

jurraca commented Feb 20, 2025

Did a fresh run to compare both master and this branch:

Run from master --- Start Kartograf ---

Kartograf version: 0.4.8
Using rpki-client version 9.4 (recommended).
The epoch for this run is: 1740067130 (2025-02-20 15:58:50 UTC, local: 2025-02-20 15:58:50 WET)

--- Fetching RPKI ---

Downloaded TAL for AFRINIC to /home/base/code/asmap/kartograf/data/1740067130/rpki/tals/afrinic.tal, file hash: 2838ef30ea27ce5705abf5f5adb131d8c35b1f50858338a2f3c84bb207c2fa35
Downloaded TAL for APNIC to /home/base/code/asmap/kartograf/data/1740067130/rpki/tals/apnic.tal, file hash: 472e551f7c551c2e999e582b7c9437d3bee4900fe53afff62aeb28d4940ade94
Downloaded TAL for ARIN to /home/base/code/asmap/kartograf/data/1740067130/rpki/tals/arin.tal, file hash: 1f8bdb03bcc30a3b8e11fd9a87102fba250c22137a3c8baa9c81b139cb412639
Downloaded TAL for LACNIC to /home/base/code/asmap/kartograf/data/1740067130/rpki/tals/lacnic.tal, file hash: d44bb9394ab009c8b53e5efebf2a1c9450bab61a27efe00de5a3e4587a3a2f6a
Downloaded TAL for RIPE to /home/base/code/asmap/kartograf/data/1740067130/rpki/tals/ripe.tal, file hash: 59ca27ef93f23682749fcefe7c6d70fbc723343549ff9e4d3996acaff79817fb
Downloading RPKI Data, this may take a while.
`Downloaded RPKI Data, hash sum: b6d2609960abb686921b76522ad0681180568b2913f3a0946603bd222dd65b5d
...finished in 0:04:35.294664

--- Fetching IRR ---

Downloading afrinic.db.gz
Downloaded afrinic.db.gz, file hash: d6999d6a7f59dfcedb2439b849598ad7acc663d5eb10a7589ce10a9e5344cc26
Downloading apnic.db.route.gz
Downloaded apnic.db.route.gz, file hash: 5b54735f0a70db6d12bcf13ee31ff5614598d31d9d287d2c6908cd73483e85ed
Downloading apnic.db.route6.gz
Downloaded apnic.db.route6.gz, file hash: 019e93bc9f64d5a99fa21de0dd30db117fae1481741f714bcf8f59b2a83e9a50
Downloading arin.db.gz
Downloaded arin.db.gz, file hash: 8b695d0330ed234c22e780eb11855ead2c384e992bd180a68f5be3e06baef1ff
Downloading lacnic.db.gz
Downloaded lacnic.db.gz, file hash: 7638b455d19e4c282d4eb671facb0ba3159e744de96153d80a3527288807098f
Downloading ripe.db.route.gz
Downloaded ripe.db.route.gz, file hash: e861359efe1bb6fe3a83142656607ab66f93b750dccde8293f58c6a695805fbe
Downloading ripe.db.route6.gz
Downloaded ripe.db.route6.gz, file hash: e332c2110d027ef2aa3d0f7bd9518536f0829a1a4c3abe5df8c2e7ae3dbe8788
...finished in 0:01:46.593931

--- Fetching Routeviews pfx2as ---

Downloading from https://publicdata.caida.org/datasets/routing/routeviews-prefix2as/2025/02/routeviews-rv2-20250218-1200.pfx2as.gz
Downloaded /home/base/code/asmap/kartograf/data/1740067130/collectors/routeviews_pfx2asn_ip4.txt.gz, file hash: 26705935a427b0cfd7d3013d33a7b8805be12a501c05847a25cb76946e1ff5b6
Downloading from https://publicdata.caida.org/datasets/routing/routeviews6-prefix2as/2025/02/routeviews-rv6-20250219-1200.pfx2as.gz
Downloaded /home/base/code/asmap/kartograf/data/1740067130/collectors/routeviews_pfx2asn_ip6.txt.gz, file hash: e349950a8c87cdfe0ed4306244f8173ad96b821dc74fa18f189c7cb0c09ce0ea
...finished in 0:00:17.911408
--- Validating RPKI ---

Validating RPKI ROAs
295452 raw RKPI ROA files found.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 295452/295452 [30:04<00:00, 163.74it/s]
295452 RKPI ROAs validated and saved to /home/base/code/asmap/kartograf/out/1740067130/rpki/rpki_raw.json, file hash: 9cbd36d34765eb549d329fc81f0b2464623d98816574b30d8e4a5468244b0680
...finished in 0:30:22.694699

--- Parsing RPKI ---

Parsing 295452 ROAs
Result entries written: 589958
Duplicates found: 79894
Invalids found: 2122
Incompletes: 0
Non-ROA files: 0
...finished in 0:00:30.251524

--- Parsing IRR ---

Extracting afrinic.db.gz
Extracting apnic.db.route.gz
Extracting apnic.db.route6.gz
Extracting arin.db.gz
Extracting lacnic.db.gz
Extracting ripe.db.route.gz
Extracting ripe.db.route6.gz
Parsing /home/base/code/asmap/kartograf/out/1740067130/irr/afrinic.db
Parsing /home/base/code/asmap/kartograf/out/1740067130/irr/arin.db
Parsing /home/base/code/asmap/kartograf/out/1740067130/irr/ripe.db.route6
Parsing /home/base/code/asmap/kartograf/out/1740067130/irr/apnic.db.route6
Parsing /home/base/code/asmap/kartograf/out/1740067130/irr/lacnic.db
Parsing /home/base/code/asmap/kartograf/out/1740067130/irr/ripe.db.route
Parsing /home/base/code/asmap/kartograf/out/1740067130/irr/apnic.db.route
Found valid, unique entries: 1895494
...finished in 0:02:22.680409

--- Merging RPKI and IRR data ---

Parse base file to dictionary
Parse extra file to Pandas DataFrame
Merging extra prefixes that were not included in the base file.
Finished merging extra prefixes.
Finished filtering! Originally 1895494 entries filtered down to 375502
Merging base file with filtered extra file
...finished in 0:08:53.697674

--- Parsing Routeviews pfx2as ---

Unzipping /home/base/code/asmap/kartograf/data/1740067130/collectors/routeviews_pfx2asn_ip4.txt.gz
Formatting /home/base/code/asmap/kartograf/out/1740067130/collectors/routeviews_pfx2asn_ip4.txt
Unzipping /home/base/code/asmap/kartograf/data/1740067130/collectors/routeviews_pfx2asn_ip6.txt.gz
Formatting /home/base/code/asmap/kartograf/out/1740067130/collectors/routeviews_pfx2asn_ip6.txt
Cleaning /home/base/code/asmap/kartograf/out/1740067130/collectors/pfx2asn.txt
...finished in 0:00:48.958645

--- Merging Routeviews and base data ---

Parse base file to dictionary
Parse extra file to Pandas DataFrame
Merging extra prefixes that were not included in the base file.
Finished merging extra prefixes.
Finished filtering! Originally 1253905 entries filtered down to 377495
Merging base file with filtered extra file
...finished in 0:07:57.339243

--- Sorting results ---

...finished in 0:00:18.698617

--- Finishing Kartograf ---

The SHA-256 hash of the result file is: d0571e1d3b0cbcef40f0c10ef1790a1528c4fb1fc1ec31e6bbb6b3008a121fc5
Total runtime: 0:57:56.251828

Reproduction run from this branch Kartograf version: 0.4.8 Using rpki-client version 9.4 (recommended). The epoch for this run is: 1740067130 (2025-02-20 15:58:50 UTC, local: 2025-02-20 15:58:50 WET) This is a reproduction run based on the data in data/1740067130/

--- Validating RPKI ---

Validating RPKI ROAs
295452 raw RKPI ROA files found.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1182/1182 [02:15<00:00, 8.73it/s]
RKPI ROAs validated and saved to /home/base/code/asmap/kartograf/out/r1740067130/rpki/rpki_raw.json, file hash: 3fc67007c91556f6de2c3540c5a237f8970900e32b0db7c80a734460194a51ff
...finished in 0:02:41.804824

--- Parsing RPKI ---

Parsing 295452 ROAs
Result entries written: 589958
Duplicates found: 79894
Invalids found: 2122
Incompletes: 0
Non-ROA files: 0
...finished in 0:00:32.170870

--- Parsing IRR ---

Extracting afrinic.db.gz
Extracting apnic.db.route.gz
Extracting apnic.db.route6.gz
Extracting arin.db.gz
Extracting lacnic.db.gz
Extracting ripe.db.route.gz
Extracting ripe.db.route6.gz
Parsing /home/base/code/asmap/kartograf/out/r1740067130/irr/afrinic.db
Parsing /home/base/code/asmap/kartograf/out/r1740067130/irr/arin.db
Parsing /home/base/code/asmap/kartograf/out/r1740067130/irr/ripe.db.route6
Parsing /home/base/code/asmap/kartograf/out/r1740067130/irr/apnic.db.route6
Parsing /home/base/code/asmap/kartograf/out/r1740067130/irr/lacnic.db
Parsing /home/base/code/asmap/kartograf/out/r1740067130/irr/ripe.db.route
Parsing /home/base/code/asmap/kartograf/out/r1740067130/irr/apnic.db.route
Found valid, unique entries: 1895494
...finished in 0:02:23.321883

--- Merging RPKI and IRR data ---

Parse base file to dictionary
Parse extra file to Pandas DataFrame
Merging extra prefixes that were not included in the base file.
Finished merging extra prefixes.
Finished filtering! Originally 1895494 entries filtered down to 375502
Merging base file with filtered extra file
...finished in 0:09:02.695826

--- Parsing Routeviews pfx2as ---

Unzipping data/1740067130/collectors/routeviews_pfx2asn_ip4.txt.gz
Formatting /home/base/code/asmap/kartograf/out/r1740067130/collectors/routeviews_pfx2asn_ip4.txt
Unzipping data/1740067130/collectors/routeviews_pfx2asn_ip6.txt.gz
Formatting /home/base/code/asmap/kartograf/out/r1740067130/collectors/routeviews_pfx2asn_ip6.txt
Cleaning /home/base/code/asmap/kartograf/out/r1740067130/collectors/pfx2asn.txt
...finished in 0:00:48.949508

--- Merging Routeviews and base data ---

Parse base file to dictionary
Parse extra file to Pandas DataFrame
Merging extra prefixes that were not included in the base file.
Finished merging extra prefixes.
Finished filtering! Originally 1253905 entries filtered down to 377495
Merging base file with filtered extra file
...finished in 0:08:11.791731

--- Sorting results ---

...finished in 0:00:19.177818

--- Finishing Kartograf ---

The SHA-256 hash of the result file is: d0571e1d3b0cbcef40f0c10ef1790a1528c4fb1fc1ec31e6bbb6b3008a121fc5
Total runtime: 0:24:02.072061

The rpki_raw.json and rpki_final.txt files both get different hashes, however the final_result.txt matches correctly: d0571e1d3b0cbcef40f0c10ef1790a1528c4fb1fc1ec31e6bbb6b3008a121fc5. So I think we can be fairly confident the end data is the same, even as the intermediate aggregation files differ.

@jurraca jurraca marked this pull request as ready for review February 20, 2025 17:46
@jurraca jurraca mentioned this pull request Feb 25, 2025
4 tasks
@fjahr
Copy link
Collaborator

fjahr commented Mar 5, 2025

This looks awesome, great job!

This is rebased on #61 (bugfix).

Nit: it's better if you put that at the very top, otherwise I might overlook it ;)

The rpki_raw.json and rpki_final.txt files both get different hashes, however the final_result.txt matches correctly: d0571e1d3b0cbcef40f0c10ef1790a1528c4fb1fc1ec31e6bbb6b3008a121fc5. So I think we can be fairly confident the end data is the same, even as the intermediate aggregation files differ.

Sounds to me that there could be just a different order. Did you check if it's just that the results are ordered differently or is there really different content? If it's just the ordering that's great but then we should order the results before printing the hash of the PRKI result, so that we don't loose information there. I.e. if the result is order dependent and order is varying between participants, then printing the hash doesn't help us with debugging anymore.

Concept ACK on the performance improvement of course and the way it's achieved looks good to me as well. Would just like to understand better where these differences come from. I will do some testing myself after I am done with #61.

@@ -1,6 +1,7 @@
import subprocess
import sys

from concurrent.futures import ThreadPoolExecutor, as_completed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing in one commit and then re-adding in the next is a bit annoying to review. You could squash both commits and that would be a bit easier.

@fjahr
Copy link
Collaborator

fjahr commented Mar 5, 2025

May need a rebase after #61 was merged

@fjahr
Copy link
Collaborator

fjahr commented Mar 5, 2025

I ran a reproduction run of 1730210400 with this and observed a different hash rpki_raw.json but the same final hash.

I am also seeing that the rpki_raw.json is slightly smaller with the change here applied (379.3 MB vs 379.6 MB before). So that may already be the hint that it's not just a sorting issue. I would still really like to understand what the cause is before we merge. Could be that it's simply that rpki-client is omitting some duplication in the results, then it should be fine. If it's hiding some other data we should find out if we can be sure that it's never changing the outcome and we didn't just get lucky in these first tests.

@fjahr
Copy link
Collaborator

fjahr commented Mar 5, 2025

Same number of elements in both files, seems to be that either rpki-client is returning different data within the objects or some kind of formatting issue...

@fjahr
Copy link
Collaborator

fjahr commented Mar 5, 2025

Huh, now it looks like it's just different formatting. Not sure what I was doing before and why that didn't work initially but now I get the same file when I run each through jq 'sort_by(.hash_id)' rpki_raw.json > sorted.json. Can you confirm that? Probably some extra whitespace or so...

If you confirm that result then please add some canonical sorting and formatting of the raw file before we hash it. We should be doing that for any of the files we hash even it's just the intermittent results. This prevents such performance changes from throwing us off again in the future....

@jurraca
Copy link
Contributor Author

jurraca commented Mar 7, 2025

Did you check if it's just that the results are ordered differently or is there really different content?

Turns out it's the ordering, the content, and also formatting :).

I don't quite get the same files using the jq sorting, but only because the reproduction run adds the file value differently in the output JSON: in the initial run it shows the absolute path of the roa file, but the reproduction run adds a relative one. If I substitute them correctly, I get the exact same JSON content. This is addressed in bc417c8.

22c7ca8 adds sorting to the JSON result. It changes the implementation, since it's easier to accumulate the results in memory and write them once, than write the JSON string, read it, sort it, then write it again. The performance penalty is negligible on my machine, though it may be worth measuring this more precisely on memory-constrained devices.

It also dumps the result as a single string to the file (not pretty-printed), which is less human readable at a glance. If all we need is the data and a hash for the file, I'd prefer to avoid potential whitespace errors or formatting mismatches at the expense of readability, but lmk if you think otherwise.

I did a fresh run, and the reproduction run for it, and got the same hash for rpki_raw.json and the final_result. Might make sense for us to do a run together for extra sanity check?

jurraca added 4 commits March 7, 2025 10:49
The rpki-client '-f' flag can handle multiple ROA files. Instead of
passing files one by one, pass them in batches of 250.
Instead of building a list of `json_results` in memory, write to
the result file as they return in batches.
This requires adding a delimiter to the resulting JSON stream, which
does not have any by default.
Finally, this removes the parallelization via ThreadPoolExecutor, and is
about 2x faster without it. May want to bring this back.
this commit brings back the ThreadPoolExecutor ommitted in the previous
commit, and results in significant performance gains (~7x). The output
rpki_raw.json gets a different hash (expected) and the final_result.txt
file gets the same hash as running with the previous commit and linear
processing.
this stores the JSON results in memory, then sorts and writes them to
disk, avoiding the need to handle concatenating JSON strings.
@jurraca jurraca force-pushed the improve-rpki-validation branch from 8179a08 to 22c7ca8 Compare March 7, 2025 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants