-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
407 create cloud space #415
Conversation
Discussed with @jwestw the plan of action for the rest of the ticket. We would like to integrate using the GCP bucket created into our script by using a switch in config (either cloud or local) which means that we can decide to either use cloud or local data. We will be using the function generate_signed_url() in GCP. There are two courses of action for this:
I will be testing out both methods and tidying the functions in data_ingest.py . |
A lot of the work has now been completed - I have ticked the files below that I've completed:
We currently have a few caveats with the issues I've mentioned above. So, for some reason, we can't run SDG_scotland.py because of an issue when we use from main import stops_geo_df. I'm not sure why this is, and not actually sure what this does. We are also currently importing the geo_df (screenshot above again) from local at the moment while we figure out why it isn't able to get the file from the google bucket. Aside from these two issues, all scripts will run with the data from the bucket :) |
The requirements of the ticket are now complete. We have an end-to-end process where we can use either local on a local machine or cloud data from our GCP bucket. Ticket is now ready for review. |
Changed expiration time on URL Get correct url or link from dict Take ext and return list or abs path
@paigeh1 and I have been reviewing this and it has gone quite well, as we have made a lot of fixes that allow the system to run with no local data present, entirely relying on the cloud-hosted data.
However we are experiencing an error on And this is what my local folder looks like (not sure if these files have just been downloaded, but I think they have) |
Next steps:
|
I am re-running every pipeline after deleting not only the files but the folders in the data folder. This has created a lot of problems which I am solving. Succesfully run:
I have also made main.py into a runner which runs all pipelines in order. Also, I had added a lot of improvements that create folders if they don't exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the conflict is sorted, I would approve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All works for me! Happy to approve
Pull Request submission
This is the first stage in getting our data onto GCP which, although it might be slower to load the data initially, it will negate the need for each user to have a complete copy of the data on their machine.
To keep the GCP account as secure as possible I have created a bucket which is not open to the public. I may be able to do this later but probably need to speak to somebody in cyber security (or similar) first. For now, there is a "service account" which allows reading and listing of the files in that particular bucket only. To load these credientials you'll need the key which is a json file, you you will need to put it in the
secrets/
folder.I will instruct the reviewer(s) how to get the json file separately.
I believe this PR meets the following requirements
For point 2, I would say that I do not actually "mount the bucket as [a] drive". This deliverable was written when I didn't understand how to work with the bucket properly. In fact you can mount it, but it's not advisable and requires messing with settings at the OS level. Instead I am creating a gcp storage object in Python - which is the advised way of doing a similar thing.
The changes I have made here are:
class
calledGCPBucket
which creates a storage object with the correct credentials and bucket name. This creates a connection to enable the functions/methods to work.download_file
to theGCPBucket
class that downloads a file from the bucket.generate_signed_url
to the GCPBucket class that generates a signed URL for a file that is valid for 5 minutes.Closes or fixes
Closes #407
Code
Documentation
Any new code includes all the following forms of documentation:
parameters
andreturns
for all major functionsData
Testing
Peer Review Section
requirements.txt
Final approval (post-review)
The author has responded to my review and made changes to my satisfaction.
Review comments
Insert detailed comments here!
These might include, but not exclusively:
that it is likely to interact with?)
works correctly?)
Your suggestions should be tailored to the code that you are reviewing.
Be critical and clear, but not mean. Ask questions and set actions.