Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(PPS-411): Add bucket selection support for multi part upload #1112

Merged
merged 16 commits into from
Apr 5, 2024

Conversation

BinamB
Copy link
Contributor

@BinamB BinamB commented Aug 15, 2023

Jira Ticket: PXP-xxxx

  • Remove this line if you've changed the title to (PXP-xxxx): <title>

cdis-data-client PR (uc-cdis/cdis-data-client#123)

New Features

  • Added upload bucket selection support for multipart upload
  • Add support for non-aws s3 buckets for multipart upload

Breaking Changes

Bug Fixes

Improvements

Dependency updates

Deployment changes

@@ -176,17 +177,9 @@ def upload_data_file():
)

protocol = params["protocol"] if "protocol" in params else None
bucket = params.get("bucket")
bucket = params.get("bucket", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If should already default to None, you don't need this

logger.debug(f"Bucket '{bucket}' not in ALLOWED_DATA_UPLOAD_BUCKETS config")
raise Forbidden(f"Uploading to bucket '{bucket}' is not allowed")

_is_allowed_data_upload_bucket_configured(bucket)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this function already exists, but the name is a bit confusing. The prefix "is" usually implies a boolean return. Since this is a check I would change this to something like "verify_data_upload_bucket_configuration"

@@ -257,12 +256,18 @@ def generate_multipart_upload_presigned_url():
default=default_expires_in,
)

bucket = params.get("bucket", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.get() already returns None if it's not found, so the , None is redundant. You could just .get("bucket")

if bucket:
_is_allowed_data_upload_bucket_configured(bucket)

# TODO add bucket param here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a real TODO? like we need to do something in the future, can we do it now?

If this is just a note for the future in case we need to add things, don't use TODO, just say note: or something

@@ -285,6 +290,7 @@ def complete_multipart_upload():
raise UserError("missing required arguments: {}".format(list(missing)))

default_expires_in = flask.current_app.config.get("MAX_PRESIGNED_URL_TTL", 3600)
bucket = params.get("bucket", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as prev comments, can just use .get(

bucket:
type: string
required: false
description: bucket to upload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: bucket to upload
description: bucket to upload to

bucket:
type: string
required: false
description: bucket to upload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: bucket to upload
description: bucket to upload to

tests/data/test_data.py Show resolved Hide resolved
raise InternalError(
"fence not configured with data upload bucket; can't create signed URL"
if bucket:
s3_buckets = get_value(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this logic exactly what _is_allowed_data_upload_bucket_configured(bucket) is doing? Can you reuse that here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you still left a chunk of code you don't need. If you're calling verify you don't need to get the s3_buckets here

@@ -9,6 +9,7 @@
BlankIndex,
IndexedFile,
get_signed_url_for_file,
_is_allowed_data_upload_bucket_configured,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't/shouldn't import private functions or variables. The prefixed _ implies private in Python naming conventions so if you need the function to be used here, you need to rename it to remove the _ prefix and add a docstring

@BinamB BinamB changed the title (feat): Add bucket selection support for multi part upload (PPS-411): Add bucket selection support for multi part upload Oct 18, 2023
fence/blueprints/data/indexd.py Show resolved Hide resolved
from fence.errors import InternalError

MAX_TRIES = 5

logger = get_logger(__name__)


def initilize_multipart_upload(bucket, key, credentials):
def initilize_multipart_upload(bucket_name, key, credentials):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor name for typo: initialize, not initilize

fence/blueprints/data/blueprint.py Show resolved Hide resolved
if not bucket:
try:
bucket = flask.current_app.config["DATA_UPLOAD_BUCKET"]
except KeyError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error will never happen b/c DATA_UPLOAD_BUCKET is part of the default config, so it will always be set, it might just be ''. And this isn't catching the case where the bucket is actually empty ''.

So you can simplify/fix this whole block with:

bucket = bucket or flask.current_app.config["DATA_UPLOAD_BUCKET"]

if not bucket:
    raise InternalError("fence not configured with data upload bucket; can't create signed URL")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a test that deletes that configuration so this will fail. https://github.com/uc-cdis/fence/blob/master/tests/data/test_blank_index.py#L351
I don't think it gets another default value.
Should I just set the test to empty the field rather than delete the config since that would be more realistic?


def verify_data_upload_bucket_configuration(bucket):
s3_buckets = get_value(
flask.current_app.config,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to use get_value b/c ALLOWED_DATA_UPLOAD_BUCKETS is part of the default configuration, which guarantees its existence in the config dictionary. If you're trying to check if it's an empty list, then this isn't catching that either.

s3_buckets = flask.current_app.config["ALLOWED_DATA_UPLOAD_BUCKETS"]

if not s3_buckets:
    raise InternalError("ALLOWED_DATA_UPLOAD_BUCKETS not configured")

raise InternalError(
"fence not configured with data upload bucket; can't create signed URL"
if bucket:
s3_buckets = get_value(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you still left a chunk of code you don't need. If you're calling verify you don't need to get the s3_buckets here

verify_data_upload_bucket_configuration(bucket)
else:
try:
bucket = flask.current_app.config["DATA_UPLOAD_BUCKET"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment about default configs. this will never KeyError

else:
try:
bucket = flask.current_app.config["DATA_UPLOAD_BUCKET"]
except KeyError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment

@@ -95,7 +125,7 @@ def complete_multipart_upload(bucket, key, credentials, uploadId, parts):


def generate_presigned_url_for_uploading_part(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are all positive tests. I know they were here before, but can you add some for the error cases like when DATA_UPLOAD and/or ALLOWED_DATA_UPLOAD_BUCKETS buckets aren't specified / are empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like my previous comment might resolve this test case? 🤔

@lbeckman314
Copy link
Contributor

lbeckman314 commented Jan 22, 2024

Thanks @BinamB for your patience, working on the unit tests issues blocking #1127 which targets this branch (let us know if we should specify a different branch to merge into!).

I will update here and in Slack when the tests are running successfully and it's ready for review!

MaribelleHGomez
MaribelleHGomez previously approved these changes Apr 2, 2024
@BinamB BinamB force-pushed the feat/multi_part_bucket_param branch from bbaa9bb to 517566d Compare April 4, 2024 15:32
@BinamB BinamB merged commit 558ac4e into master Apr 5, 2024
9 checks passed
@BinamB BinamB deleted the feat/multi_part_bucket_param branch April 5, 2024 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants