-
-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import Open Food Facts Ingredients without Mongo #1540
Conversation
cool idea! How does this work performance/memory wise? (also, my first thought would be to download the dump somewhere very obvious, so that we don't forget to delete it afterwards) |
My understanding is that, since the gzipped file is opened as a stream, it doesn't load the entire file into memory. The code just loops over the lines as they are decompressed om the fly. Also, the downloading itself is done in chunks, so memory wise I don't expect big issues here. The gzipped archive is kept for now (and not redownloaded if run again). But it could be an option to automatically remove it after successfully loading the content in the database. I didn't try it yet, but ideally, one would do regular (e.g. every week) updates and use the open food facts delta files provided. I suspect they are the same format and since they are much smaller they can be applied incrementally with low effort. In this more advanced setup, it does make sense to add a new database table so the openfoodfact import events are registered and can be used to decide which files to download. |
using the delta files would indeed be a huge improvement, at the moment the import is run... very sporadically. I would even suggest that we don't even need a new table, just writing the last date into a text file in the script folder would be enough or something similarly simple. In any case, I'll try to take a look at your changes and run them on my machine today |
for product in db.products.find( | ||
{ | ||
if options['usejsonl']: | ||
products=self.products_jsonl(languages=list(languages.keys()),completeness=self.completeness) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import json | ||
import requests | ||
from gzip import GzipFile | ||
off_url='https://static.openfoodfacts.org/data/openfoodfacts-products.jsonl.gz' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a default in env file? That way if the URL changes, people can update with a hotfix to env file instead of pulling in a new release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that only "upstream" wger regularly does this and other instances sync from us like with the exercises. Then we don't generate so much traffic for OFF (and don't need to respond as fast to such changes, if they ever occur)
Since I was working on a new USDA import, I have merged this branch into mine: #1666 Closing here |
Proposed Changes
Currently, I understood that downloading the ingredients from the Open Food Facts server requires an import using a running Mongo server. This pull request avoids using Mongo and unpacking the database file. It directly uses the JSONL data dump format as documented here , and loops over the JSON entries extracted from a Gzip stream.
The pull request adds an option to the management command import-off-products.py.
Furthermore, during the import I've noticed that many entries for
license_authors
were too long for the varchar(600) field so this pull requests also includes a change of this entry to TextField so it can accommodate longer strings. So this pull request does require a database migration.Please check that the PR fulfills these requirements
(I did not do this yet, in light of soliciting your thoughts on this first)
Other questions
(e.g. database migration)?
Yes, the license_authors field has changed from varchar(600) to TextField