-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue with only max 20 XML sitemap files generated on cron jobs #161
base: dev
Are you sure you want to change the base?
Conversation
…t tools (as curl)
…ty. All paths either return or exit() through class Tools
…tools for large shops in cron jobs
Just to clarify, the bugfix has been in production for several weeks now, although with the latest module release. I have now deployed the development branch containing the pull request to the server for testing, and it is still functioning as intended. 74 required sitemaps have been generated successfully with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dkarvounaris! LGTM!
I updated your PR to include the table required for our processes. Could you please write a short "How to test"? Something like "Have X products in your store, run CRON from CLI, you should have X sitemaps, but before PR you have 20 sitemaps"? You also forgot about translation domains.
@kpodemski Thank you! I added the requested info. If you require something else from me for the pull request to be complete to be merged, please let me know. |
Co-authored-by: Krystian Podemski <kpodemski@users.noreply.github.com>
@PrestaShop/qa-team it's quite a technical PR, I'd like to do the tests myself |
Hello @kpodemski Yes! you can do it. Good luck |
@dkarvounaris the tests are failing, could you please fix it? |
1_index_sitemap.xml
file. To run the test, execute the sitemap generation with the link for cron jobs given by the module on the CLI, as example:/usr/bin/curl -L --max-redirs 99 "http://shop.com/modules/gsitemap/gsitemap-cron.php?token=1234..."
and check that sitemap generation adds files past1_en_20_sitemap.xml
.Why this pull request?
When generating a sitemap using cron, a CLI tool like curl is typically employed to send an HTTP request to gsitemap-cron.php. For large sites with tens or hundreds of thousands of products, the module divides the sitemap files by size. However, the generation process halts after creating the 20th sitemap XML file, preventing the inclusion of all products in the sitemap. This pull request resolves this issue, enabling sitemap generation to proceed beyond the initial 20 files and ensuring that all products are added to the sitemap.
What is the root of the issue?
The original code evaluates every 20 files generated and, rather than simply redirecting back to the module for the generation of the subsequent sitemap XML file, which would work, it sends an HTTP header "Refresh: 5" instead of an HTTP redirect.
This header lacks documentation (note that this refers to an HTTP header, not the meta refresh tag, which however is deprecated too) and is absent from the HTTP standard. The Refresh HTTP header is a non-standard implementation from the Netscape era, which has only been maintained by browsers. Most, if not all, non-browser HTTP toolkits, such as curl, do not implement it. As a result, it is not guaranteed that using the Refresh HTTP header with tools other than browsers, which is almost always the case with cron jobs, will successfully complete the sitemap generation for very large shops. Rather the opposite, generation will stop with an unrecoverable error.
Sources:
curl/curl#3657
https://daniel.haxx.se/blog/2019/03/12/looking-for-the-refresh-header/
request/request#92
https://chromium-review.googlesource.com/c/chromium/src/+/1520146 (a possible intention to remove support even in chrome, because non-standard)
Moreover, there is no need or advantage to using this HTTP header for every 20th cron job redirect, as a simple redirect functions as intended, and the refresh – even if it were to work – serves no purpose.
Changes introduced by the pull request
There are no deprecations or backward compatibility breaks, and no new or otherwise modified code that could potentially disrupt existing functionality.
The only additional change involves simplifying the else statements and removing superfluous exit() statements for improved clarity and reduced complexity, as seen in this commit: dkarvounaris@a504413?diff=split&w=1
Furthermore, I have updated the documentation to include information on the possible need of adjusting the number of redirects followed by the tool used in conjunction with cron jobs. This is particularly useful for very large shops which may require more redirects than the tool's default redirect limit. Something which can go otherwise unnoticed and may be difficult to debug.