-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve rpki validation #62
base: master
Are you sure you want to change the base?
Conversation
|
||
json_results = [result.decode() for result in results if result] | ||
total = len(files) | ||
batch_size = 250 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should test more batch size values. 250 worked out better than 100 and 500, so I stuck with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already a great win, it can be optimized further later. Let's first make sure it's robust and get it in asap :)
c6e995a
to
1f5eaa5
Compare
kartograf/rpki/fetch.py
Outdated
normalized = result.replace(b"\n}\n{\n\t", b"\n},\n{\n").decode('utf-8').strip() | ||
res_file.write(normalized) | ||
completed += 1 | ||
if completed < total_batches: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping track of completed vs total, so that we write a comma between each batch except after the last batch, which would result in invalid JSON.
Did a fresh run to compare both master and this branch: Run from master--- Start Kartograf ---Kartograf version: 0.4.8 --- Fetching RPKI --- Downloaded TAL for AFRINIC to /home/base/code/asmap/kartograf/data/1740067130/rpki/tals/afrinic.tal, file hash: 2838ef30ea27ce5705abf5f5adb131d8c35b1f50858338a2f3c84bb207c2fa35 --- Fetching IRR --- Downloading afrinic.db.gz --- Fetching Routeviews pfx2as --- Downloading from https://publicdata.caida.org/datasets/routing/routeviews-prefix2as/2025/02/routeviews-rv2-20250218-1200.pfx2as.gz Validating RPKI ROAs --- Parsing RPKI --- Parsing 295452 ROAs --- Parsing IRR --- Extracting afrinic.db.gz --- Merging RPKI and IRR data --- Parse base file to dictionary --- Parsing Routeviews pfx2as --- Unzipping /home/base/code/asmap/kartograf/data/1740067130/collectors/routeviews_pfx2asn_ip4.txt.gz --- Merging Routeviews and base data --- Parse base file to dictionary --- Sorting results --- ...finished in 0:00:18.698617 --- Finishing Kartograf --- The SHA-256 hash of the result file is: d0571e1d3b0cbcef40f0c10ef1790a1528c4fb1fc1ec31e6bbb6b3008a121fc5 Reproduction run from this branchKartograf version: 0.4.8 Using rpki-client version 9.4 (recommended). The epoch for this run is: 1740067130 (2025-02-20 15:58:50 UTC, local: 2025-02-20 15:58:50 WET) This is a reproduction run based on the data in data/1740067130/--- Validating RPKI --- Validating RPKI ROAs --- Parsing RPKI --- Parsing 295452 ROAs --- Parsing IRR --- Extracting afrinic.db.gz --- Merging RPKI and IRR data --- Parse base file to dictionary --- Parsing Routeviews pfx2as --- Unzipping data/1740067130/collectors/routeviews_pfx2asn_ip4.txt.gz --- Merging Routeviews and base data --- Parse base file to dictionary --- Sorting results --- ...finished in 0:00:19.177818 --- Finishing Kartograf --- The SHA-256 hash of the result file is: d0571e1d3b0cbcef40f0c10ef1790a1528c4fb1fc1ec31e6bbb6b3008a121fc5 The |
This looks awesome, great job!
Nit: it's better if you put that at the very top, otherwise I might overlook it ;)
Sounds to me that there could be just a different order. Did you check if it's just that the results are ordered differently or is there really different content? If it's just the ordering that's great but then we should order the results before printing the hash of the PRKI result, so that we don't loose information there. I.e. if the result is order dependent and order is varying between participants, then printing the hash doesn't help us with debugging anymore. Concept ACK on the performance improvement of course and the way it's achieved looks good to me as well. Would just like to understand better where these differences come from. I will do some testing myself after I am done with #61. |
@@ -1,6 +1,7 @@ | |||
import subprocess | |||
import sys | |||
|
|||
from concurrent.futures import ThreadPoolExecutor, as_completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing in one commit and then re-adding in the next is a bit annoying to review. You could squash both commits and that would be a bit easier.
May need a rebase after #61 was merged |
I ran a reproduction run of I am also seeing that the |
Same number of elements in both files, seems to be that either |
Huh, now it looks like it's just different formatting. Not sure what I was doing before and why that didn't work initially but now I get the same file when I run each through If you confirm that result then please add some canonical sorting and formatting of the raw file before we hash it. We should be doing that for any of the files we hash even it's just the intermittent results. This prevents such performance changes from throwing us off again in the future.... |
Turns out it's the ordering, the content, and also formatting :). I don't quite get the same files using the 22c7ca8 adds sorting to the JSON result. It changes the implementation, since it's easier to accumulate the results in memory and write them once, than write the JSON string, read it, sort it, then write it again. The performance penalty is negligible on my machine, though it may be worth measuring this more precisely on memory-constrained devices. It also dumps the result as a single string to the file (not pretty-printed), which is less human readable at a glance. If all we need is the data and a hash for the file, I'd prefer to avoid potential whitespace errors or formatting mismatches at the expense of readability, but lmk if you think otherwise. I did a fresh run, and the reproduction run for it, and got the same hash for |
The rpki-client '-f' flag can handle multiple ROA files. Instead of passing files one by one, pass them in batches of 250. Instead of building a list of `json_results` in memory, write to the result file as they return in batches. This requires adding a delimiter to the resulting JSON stream, which does not have any by default. Finally, this removes the parallelization via ThreadPoolExecutor, and is about 2x faster without it. May want to bring this back.
this commit brings back the ThreadPoolExecutor ommitted in the previous commit, and results in significant performance gains (~7x). The output rpki_raw.json gets a different hash (expected) and the final_result.txt file gets the same hash as running with the previous commit and linear processing.
this stores the JSON results in memory, then sorts and writes them to disk, avoiding the need to handle concatenating JSON strings.
8179a08
to
22c7ca8
Compare
TLDR: by passing multiple ROA files to the
-f
files, it's about 2x faster.This PR removes parallelization viaThreadPoolExecutor()
which may result in even more gains.Putting this up for concept ACK (@fjahr), while the impl works and reproduces existing validation, need to run more sanity checks as the output of
rpki_raw.json
is fairly different.Impl
The rpki-client '-f' flag can handle multiple ROA files 1. Instead of passing files one by one, which means invoking the rpki-client via a subprocess call, we pass them in batches of 250.
This means we get back a binary blob of concatenated JSON, without delimiters. We match on the separator between items
b"\n}\n{\n\t"
(which is unique to defining the end of one item and the beginning of another), and insert a JSON-valid separator.Since now we just have JSON, we can write to a JSON file for each batch without building up results in memory.
I removed the parallelism viaOn my machine, the non-parallelized version is about 2x faster, and the parallelized version another 7 or 8x faster.ThreadPoolExecutor
since it seemed to be mostly a feature to speed up the previous implementation. I couldn't get it to work with the new setup but didn't look too hard, it may be worth exploring to get some more performance improvements. On my machine, this implementation is a little more than 2x faster than the current.Testing
I noticed some significant differences in the intermediate
rpki_raw.json
file, but these didn't seem to impact therpki_final.txt
file. i.e. the RPKI validation results in the same output. Notably, this approach seems to reduce duplicates in therpki_raw.json
.To test, you want to have an existing run with the old version (or run one yourself), and then run a reproduction run with this branch.
I got different hashes for theI ran runs and got the samerpki_final.txt
but the content was identical. I'm not sure why this is, and I'm still checking through this.rpki_final.txt
hashes across different reproductions runs. I did get differentfinal_result.txt
hashes but I don't think it's related to this change (to be confirmed).This is rebased on #61 (bugfix).