Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

407 create cloud space #415

Merged
merged 48 commits into from
Dec 22, 2023
Merged

407 create cloud space #415

merged 48 commits into from
Dec 22, 2023

Conversation

james-westwood
Copy link
Collaborator

@james-westwood james-westwood commented Apr 14, 2023

Pull Request submission

This is the first stage in getting our data onto GCP which, although it might be slower to load the data initially, it will negate the need for each user to have a complete copy of the data on their machine.

To keep the GCP account as secure as possible I have created a bucket which is not open to the public. I may be able to do this later but probably need to speak to somebody in cyber security (or similar) first. For now, there is a "service account" which allows reading and listing of the files in that particular bucket only. To load these credientials you'll need the key which is a json file, you you will need to put it in the secrets/ folder.

I will instruct the reviewer(s) how to get the json file separately.

I believe this PR meets the following requirements

  1. Create a Google bucket
    
  2. change the data source to the bucket- mount the bucket as drive
    
  3. test that data loads
    

For point 2, I would say that I do not actually "mount the bucket as [a] drive". This deliverable was written when I didn't understand how to work with the bucket properly. In fact you can mount it, but it's not advisable and requires messing with settings at the OS level. Instead I am creating a gcp storage object in Python - which is the advised way of doing a similar thing.

The changes I have made here are:

  • Added a new class called GCPBucket which creates a storage object with the correct credentials and bucket name. This creates a connection to enable the functions/methods to work.
  • Added a new method called download_file to the GCPBucket class that downloads a file from the bucket.
  • Added a new method called generate_signed_url to the GCPBucket class that generates a signed URL for a file that is valid for 5 minutes.
  • Added a new method called get_file_list to the GCPBucket class that returns a list of all the files in the bucket as "blobs".
  • Added a new method called get_file_names to the GCPBucket class that prints and returns a list of all the file names in the bucket.

Closes or fixes

Closes #407

Code

  • Requirements My/our code matches the requirements of the ticket
  • Functionality: New functions meet requirements in issue ticket
  • Compliant Code Code is as PEP 8 compliant as I can humanly make it
  • Code runs The code runs on my machine
  • Clean Code
    • Code has been linted (use your favourite linter)
    • Code adheres to DRY

Documentation

Any new code includes all the following forms of documentation:

  • Function Documentation Docstrings within the function(s') definition(s) have been created
    • Includes parameters and returns for all major functions
    • Includes data types
  • Updated Documentation: Working doc has been updated and a new ticket to update the master documentation has been created.

Data

  • All data needed to run this script has been added to Sync

Testing

  • Unit tests Unit tests have been created and are passing or a new ticket to create tests has been created

Peer Review Section

  • All requirements install from (updated) requirements.txt
  • Documentation has been created and is clear - check the google Doc
  • Doctrings (Google format) have been created and accurately describe the function's functionality
  • Unit tests pass, or if not present a new ticket to create tests has been created
  • Code runs The code runs on reviewer's machine

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction.

  • I recommend merging this request.

Review comments

Insert detailed comments here!

These might include, but not exclusively:

  • bugs that need fixing (does it work as expected? and does it work with other code
    that it is likely to interact with?)
  • alternative methods (could it be written more efficiently or with more clarity?)
  • documentation improvements (does the documentation reflect how the code actually works?)
  • additional tests that should be implemented (do the tests effectively assure that it
    works correctly?)
  • code style improvements (could the code be written more clearly?)

Your suggestions should be tailored to the code that you are reviewing.
Be critical and clear, but not mean. Ask questions and set actions.

@james-westwood james-westwood linked an issue Apr 14, 2023 that may be closed by this pull request
@jwestw jwestw requested review from nkshaw23 and paigeh-fsa April 14, 2023 18:51
@paigeh-fsa
Copy link
Collaborator

Discussed with @jwestw the plan of action for the rest of the ticket. We would like to integrate using the GCP bucket created into our script by using a switch in config (either cloud or local) which means that we can decide to either use cloud or local data.

We will be using the function generate_signed_url() in GCP.

There are two courses of action for this:

  1. We could create two modules (local_ingest.py, cloud_ingest.py) which will have the exact same function names but tailor to either the cloud or local machine. For example, we would use a URL to ingest data from the cloud, but a file path for a local machine.

  2. The above option may mean we have WET code with a lot of duplication and minimal changes. We could, instead, write a function such as path_or_url() which means, depending on the config switch, we would create the file path or url for a given dataset. This would mean we wouldn't be repeating code.

I will be testing out both methods and tidying the functions in data_ingest.py .

@paigeh-fsa
Copy link
Collaborator

Have run into an error below:

Screenshot 2023-08-11 at 14 20 00

There appears to be an error when we try to load the data from the cloud. Need to dig a bit deeper into why this is happening but leaving this for now.

@paigeh-fsa
Copy link
Collaborator

paigeh-fsa commented Aug 11, 2023

Have another issue with the SDG_scotland.py function:
Screenshot 2023-08-11 at 15 36 10

Run main before SDG_scotland.py to hopefully resolve this issue.

@paigeh-fsa
Copy link
Collaborator

paigeh-fsa commented Aug 11, 2023

A lot of the work has now been completed - I have ticked the files below that I've completed:

  • SDG_bus_timetable.py
  • SDG_train_timetable.py
  • SDG_ni.py
  • SDG_scotland.py
  • eng_wales_pre_process.py
  • SDG_eng_wales.py

We currently have a few caveats with the issues I've mentioned above. So, for some reason, we can't run SDG_scotland.py because of an issue when we use from main import stops_geo_df. I'm not sure why this is, and not actually sure what this does. We are also currently importing the geo_df (screenshot above again) from local at the moment while we figure out why it isn't able to get the file from the google bucket. Aside from these two issues, all scripts will run with the data from the bucket :)

@paigeh-fsa
Copy link
Collaborator

paigeh-fsa commented Aug 18, 2023

The requirements of the ticket are now complete.

We have an end-to-end process where we can use either local on a local machine or cloud data from our GCP bucket.

Ticket is now ready for review.

@paigeh-fsa paigeh-fsa marked this pull request as ready for review August 18, 2023 13:50
@jwestw
Copy link
Contributor

jwestw commented Oct 20, 2023

@paigeh1 and I have been reviewing this and it has gone quite well, as we have made a lot of fixes that allow the system to run with no local data present, entirely relying on the cloud-hosted data.
We have successfully run:

  • SDG_bus_timetable.py
  • SDG_train_timetable.py
  • eng_wales_pre_process.py
  • SDG_eng_wales.py
  • SDG_scotland.py

However we are experiencing an error on SDG_northern_ireland.py

image

And this is what my local folder looks like (not sure if these files have just been downloaded, but I think they have)

image

@paigeh-fsa
Copy link
Collaborator

Next steps:

  • Double check SDG_northern_ireland.py runs
  • Resolve conflicts
  • @paigeh1 @jwestw run the whole pipeline again without data to see if cloud run works.

@jwestw
Copy link
Contributor

jwestw commented Nov 24, 2023

I am re-running every pipeline after deleting not only the files but the folders in the data folder. This has created a lot of problems which I am solving.

Succesfully run:

  • SDG_train_timetable.py
  • SDG_bus_timetable.py
  • eng_wales_pre_process.py
  • SDG_eng_wales.py
  • SDG_scotland.py
  • SDG_northern_ireland.py

I have also made main.py into a runner which runs all pipelines in order.

Also, I had added a lot of improvements that create folders if they don't exist.

@jwestw jwestw self-requested a review November 24, 2023 19:50
Copy link
Contributor

@jwestw jwestw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the conflict is sorted, I would approve this.

Copy link
Collaborator

@paigeh-fsa paigeh-fsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All works for me! Happy to approve

@paigeh-fsa paigeh-fsa merged commit 4b159f0 into master Dec 22, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create an cloud space for all the data
4 participants