Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(PPS-411): Add bucket selection support for multi part upload #123

Merged
merged 17 commits into from
Jan 22, 2024

Conversation

BinamB
Copy link
Contributor

@BinamB BinamB commented Aug 15, 2023

Jira Ticket: PXP-xxxx

  • Remove this line if you've changed the title to (PXP-xxxx): <title>

Fence PR (uc-cdis/fence#1112)

New Features

  • Add bucket selection support for multi part upload
  • Add bucket selection support for single part upload
  • Add bucket selection support for upload-multiple
  • Add multi-part support for upload-multiple

Breaking Changes

Bug Fixes

Improvements

Dependency updates

Deployment changes

@BinamB BinamB changed the title Add bucket param for multipart upload (feat): Add bucket param for multipart upload Aug 15, 2023
@coveralls
Copy link

coveralls commented Aug 15, 2023

Pull Request Test Coverage Report for Build 7589691499

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.1%) to 17.83%

Totals Coverage Status
Change from base Build 5868373361: -0.1%
Covered Lines: 498
Relevant Lines: 2793

💛 - Coveralls

@BinamB BinamB changed the title (feat): Add bucket param for multipart upload (feat): Add bucket selection support for multi part upload Aug 15, 2023
@BinamB BinamB marked this pull request as ready for review August 16, 2023 19:58
@@ -105,6 +105,7 @@ type RetryObject struct {
GUID string
RetryCount int
Multipart bool
Bucket *string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go conventions prefer the type to be all lined up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why not just use a string instead of a string pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using just string was giving me some weird errors when I tried to build it. Im not too familiar with it but I can change it to a string and give it another Go. 🥁

@@ -79,6 +81,10 @@ func retryUpload(failedLogMap map[string]commonUtils.RetryObject) {
log.Printf("Sleep for %.0f seconds\n", GetWaitTime(ro.RetryCount).Seconds())
time.Sleep(GetWaitTime(ro.RetryCount)) // exponential wait for retry

if ro.Bucket !=nil || *ro.Bucket != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a space != nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is a little confusing. I think if Bucket was just a string like GUID this would be a little easier.

@@ -38,6 +38,7 @@ type ManifestObject struct {
// InitRequestObject represents the payload that sends to FENCE for getting a singlepart upload presignedURL or init a multipart upload for new object file
type InitRequestObject struct {
Filename string `json:"file_name"`
Bucket *string `json:"bucket,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a string pointer and not just a string?

@@ -53,6 +53,8 @@ func retryUpload(failedLogMap map[string]commonUtils.RetryObject) {
var guid string
var presignedURL string
var err error
var bucketPtr string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is suffixed with Ptr but it doesn't look like it's a pointer? It's just a string

@@ -57,13 +58,15 @@ type MultipartUploadRequestObject struct {
Key string `json:"key"`
UploadID string `json:"uploadId"`
PartNumber int `json:"partNumber"`
Bucket *string `json:"bucket,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here. The pattern for other variables here seem to be to use the raw type (string, int) rather than a pointer

}

// MultipartCompleteRequestObject represents the payload that sends to FENCE for completeing a multipart upload
type MultipartCompleteRequestObject struct {
Key string `json:"key"`
UploadID string `json:"uploadId"`
Parts []MultipartPartObject `json:"parts"`
Bucket *string `json:"bucket,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment re: pointers

request := new(jwt.Request)
configure := new(jwt.Configure)
function := new(jwt.Functions)

function.Config = configure
function.Request = request

multipartCompleteObject := MultipartCompleteRequestObject{Key: key, UploadID: uploadID, Parts: parts}
var bucketPtr *string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you just passed through the bucketName as a string you wouldn't have to do all this every time

@BinamB BinamB changed the title (feat): Add bucket selection support for multi part upload (PPS-411): Add bucket selection support for multi part upload Oct 18, 2023
if batch {
if forceMultipart {
uploadPath := []string{furObjects[i].FilePath}
_, multipartFilePaths := validateFilePath(uploadPath, forceMultipart)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validateFilePath returns two arrays (singlepartFilePaths, multipartFilePaths). It looks as if the singlepartFilePaths array is ignored. @BinamB

if len(batchFURObjects) < workers {
batchFURObjects = append(batchFURObjects, furObject)
} else {
batchUpload(gen3Interface, batchFURObjects, workers, respCh, errCh)
batchUpload(gen3Interface, batchFURObjects, workers, respCh, errCh, bucketName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't believe multipartUpload gets called at all if part of a batch? @BinamB

@lbeckman314
Copy link

lbeckman314 commented Nov 2, 2023

Hey @BinamB thank you for your work adding support for small files with the multipart upload! The files are being uploaded successfully to the bucket which is fantastic, unfortunately I'm encountering a couple of potential issues during this process (that might be due to my testing setup) that could interfere with potential end users:

My Testing Setup

Testing directory and files:

.
├── DATA
│   ├── file-1MB.1.txt
│   ├── file-1MB.2.txt
│   ├── file-1MB.3.txt
│   ├── ...
│   └── file-1MB.99.txt
└── file-manifest.json

2 directories, 101 files

Command:

gen3-client upload-multiple --profile=development \
    --manifest=./file-manifest.json \
    --bucket aced-ohsu-development \
    --upload-path . \
    --force-multipart \
    --batch \
    --numparallel=10

File manifest sample (for one file):

[
  {
    "object_id": "bda45761-f6d5-543c-82ab-e459b94b96c1",
    "file_name": "DATA/file-1MB.1.txt",
    "size": 1048576,
    "modified": "2023-11-02T11:11:10.707559",
    "md5": "b6d81b360a5672d80c27430f39153e2c",
  }
 ]

Issue: Slow file validation

The upload process tries to validate files that aren't in the file manifest which can result in a somewhat long time (a few minutes) to start the actual upload. Could this be due to the following logic in upload-multiple.go?

The filePaths variable includes files that are not in the file manifest (but are in the upload path), resulting in attempts to validate/upload extra files:

upload-multiple.go#L87-L95

for i, furObject := range furObjects {
    if forceMultipart {
        uploadPath, _ := commonUtils.GetAbsolutePath(uploadPath)
        filePaths, err := commonUtils.ParseFilePaths(uploadPath, false)  <-- Includes non-manifest files?
        if err != nil {
            log.Fatalf("Error when parsing file paths: " + err.Error())
        }

        singlepartFilePaths, multipartFilePaths := validateFilePath(filePaths, forceMultipart)

Issue: Extra files uploaded

The upload process started much quicker after removing all files in the testing directory that are not included in the file manifest.

However it seems like the file manifest itself was uploaded judging by the resulting output?

2023/11/02 12:49:15 File "<TESTING DIRECTORY>/file-manifest.json" has been found in local submission history and has been skipped for preventing duplicated submissions.

Confirmed by adding a "Secret" file to the upload path (but not to the manifest):

echo "Secret password" > DATA/secret.txt

Rerunning the gen3-client command uploaded this "extra" file:

~/.gen3/logs/development_succeeded_log.json:

...
"DATA/secret.txt": "PREFIX/bf46fc01-0d27-43ca-b014-0af89334d94d"
...

Issue: Erroneous "Skipped" log after successful upload

The test files themselves were uploaded successfully, but the resulting output included the following error for each file (despite removing all upload logs prior to the test: rm ~/.gen3/logs/development_succeeded_log.json):

2023/11/02 12:49:15 File "<TESTING DIRECTORY>/file-1MB.99.txt" has been found in local submission history and has been skipped for preventing duplicated submissions.

Potential Tester Error?

Let me know if I'm missing something big or small in my testing setup! I've been known to miss the forest for the trees so it could be something obvious, can also provide any additional info on the testing process. Thank you!

}
if forceMultipart {
uploadPath, _ := commonUtils.GetAbsolutePath(uploadPath)
// filePaths, err := commonUtils.ParseFilePaths(uploadPath, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not leave dead code, if we don't need this, remove it

Copy link
Contributor

@paulineribeyre paulineribeyre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i forgot to post comments from when i started reviewing this

gen3-client/commonUtils/commonUtils.go Show resolved Hide resolved
gen3-client/g3cmd/upload-multiple.go Show resolved Hide resolved
gen3-client/g3cmd/upload-multiple.go Outdated Show resolved Hide resolved
BinamB and others added 6 commits January 8, 2024 09:43
* Use file_name provided in manifest

* Use file_name provided in manifest

* Ensure file_name used in url

* Ensure bucket passed to GeneratePresignedURL
@paulineribeyre paulineribeyre self-requested a review January 8, 2024 18:12
gen3-client/g3cmd/upload-multiple.go Outdated Show resolved Hide resolved
gen3-client/g3cmd/upload-single.go Outdated Show resolved Hide resolved
gen3-client/g3cmd/upload.go Outdated Show resolved Hide resolved
gen3-client/g3cmd/upload-multiple.go Outdated Show resolved Hide resolved
gen3-client/g3cmd/upload-multiple.go Outdated Show resolved Hide resolved
gen3-client/g3cmd/upload-multiple.go Outdated Show resolved Hide resolved
gen3-client/g3cmd/upload-multiple.go Show resolved Hide resolved
endPointPostfix := commonUtils.FenceDataUploadEndpoint + "/" + furObject.GUID
msg, err := g3.DoRequestWithSignedHeader(&profileConfig, endPointPostfix, "", nil)
if furObject.PresignedURL == "" {
endPointPostfix := commonUtils.FenceDataUploadEndpoint + "/" + furObject.GUID + "?file_name=" + url.QueryEscape(furObject.Filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did this ever work before? the file_name param is required in the fence data upload endpoint 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure. I followed the code and there are cases for a nil body but it ends here and still sends this request somehow
https://github.com/uc-cdis/cdis-data-client/blob/master/gen3-client/jwt/functions.go#L66

gen3-client/g3cmd/utils.go Outdated Show resolved Hide resolved
gen3-client/g3cmd/utils.go Outdated Show resolved Hide resolved
@paulineribeyre paulineribeyre self-requested a review January 22, 2024 16:42
@Avantol13 Avantol13 dismissed their stale review January 22, 2024 17:58

already reviewed and approved by someone else

@BinamB BinamB merged commit b9b3605 into master Jan 22, 2024
8 checks passed
@BinamB BinamB deleted the feat/bucket_param_multipart branch January 22, 2024 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants