Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrector optimization #106

Merged

Conversation

VitaliStupin
Copy link
Contributor

Corrector is currently using multiprocessing only for raw data processing, and many tasks are preformed in a single main thread that can use only one CPU core and therefore negatively impact corrector throughput.

This pull request optimizes corrector to increase throughput. Additionally a check if processed document already exists in clean_data is added (this check existed before last corrector rewrite, but was broken since that time).

  • Performing sanitise_document and correct_structure in worker threads instead of main thread.
  • Deleting duplicates in threads instead of returning to_remove_queue and slowly processing that in main thread.
  • Adding check if processed document already exists in clean_data (was broken after "matching of documents by xRequestId" commit).
  • Using multiprocessing for updating status of orphans that reached timeout.
  • Using multiprocessing for processing of documents without xRequestId.
  • Optimizing document sanitizer (using translate method instead of slow and deprecated bleach.clean).
  • Fixing calculation of total documents processed by corrector.

Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@raits
Copy link
Contributor

raits commented Nov 30, 2023

Hello @VitaliStupin!

Thank you for the pull request, we will review it as soon as possible.

Performing sanitise_document and correct_structure in worker threads instead of main thread. Computations in main thread can use only one CPU core and therefore negatively impact corrector throughput.
Deleting duplicates in threads instead of returning to_remove_queue and slowly processing that in main thread

Deleting code that was broken and no longer used after corrector started matching documents by xRequestId:
* Removing check if document marked as duplicate exists in clean_data because duplicates get deleted in any case
* Removing special handling for duplicate documents without requestInTs

Removed addition of deleted raw documents count to the total number of documents processed as these were already part of processed batch
Corrector did not check if processed document already exists in clean_data after corrector started matching documents by xRequestId.

Adding duplicate detection.
Using multiprocessing for updating status of orphans that reached timeout.

Fetching only document ids instead of full documents.
Using multiprocessing for processing of documents without xRequestId.

Adding documents without xRequestId to total number of documents processed.
Corrector is currently using slow and deprecated (mozilla/bleach#698) bleach. Based on the fact that X-Road metrics should not contain HTML it would be more beneficial to just use python translate method and replace potentially dangerous HTML characters. Translate does not parse html and estimated to be 100 times faster than bleach.

Using translate method instead of bleach.clean.

Renaming sanitise -> sanitize to be consistent with the rest of the code.
@melbeltagy melbeltagy force-pushed the corrector-optimization branch from 23a8d23 to 28637e1 Compare May 29, 2024 19:54
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
3.4% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

Copy link
Contributor

@melbeltagy melbeltagy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @VitaliStupin

@melbeltagy melbeltagy merged commit 9999439 into nordic-institute:develop May 29, 2024
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants