-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrector optimization #106
Corrector optimization #106
Conversation
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Hello @VitaliStupin! Thank you for the pull request, we will review it as soon as possible. |
Performing sanitise_document and correct_structure in worker threads instead of main thread. Computations in main thread can use only one CPU core and therefore negatively impact corrector throughput.
Deleting duplicates in threads instead of returning to_remove_queue and slowly processing that in main thread Deleting code that was broken and no longer used after corrector started matching documents by xRequestId: * Removing check if document marked as duplicate exists in clean_data because duplicates get deleted in any case * Removing special handling for duplicate documents without requestInTs Removed addition of deleted raw documents count to the total number of documents processed as these were already part of processed batch
Corrector did not check if processed document already exists in clean_data after corrector started matching documents by xRequestId. Adding duplicate detection.
Using multiprocessing for updating status of orphans that reached timeout. Fetching only document ids instead of full documents.
Using multiprocessing for processing of documents without xRequestId. Adding documents without xRequestId to total number of documents processed.
Corrector is currently using slow and deprecated (mozilla/bleach#698) bleach. Based on the fact that X-Road metrics should not contain HTML it would be more beneficial to just use python translate method and replace potentially dangerous HTML characters. Translate does not parse html and estimated to be 100 times faster than bleach. Using translate method instead of bleach.clean. Renaming sanitise -> sanitize to be consistent with the rest of the code.
23a8d23
to
28637e1
Compare
Quality Gate failedFailed conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @VitaliStupin
Corrector is currently using multiprocessing only for raw data processing, and many tasks are preformed in a single main thread that can use only one CPU core and therefore negatively impact corrector throughput.
This pull request optimizes corrector to increase throughput. Additionally a check if processed document already exists in clean_data is added (this check existed before last corrector rewrite, but was broken since that time).