Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for read access to embargoed assets on DANDI Open Data bucket #7

Open
kabilar opened this issue Jan 30, 2025 · 18 comments
Open

Comments

@kabilar
Copy link
Member

kabilar commented Jan 30, 2025

Hi @satra, as we are working to download all the data from S3 to MIT Engaging, I need read access to all embargoed data. I am currently encountering the following error with s3invsync:

...
2025-01-30T15:18:48.215145334-05:00 DEBUG process_item{url=s3://dandiarchive/blobs/000/09b/00009ba5-2a5c-48f9-80bb-5225e3a4ad53?versionId=XSGTg8XtMu.ickBuN4QpgPpmij9U44an}:download_item:clea
nup_download_path{path=./blobs/000/09b/00009ba5-2a5c-48f9-80bb-5225e3a4ad53}: s3invsync::syncer: Finished cleaning up unfinished download file
2025-01-30T15:18:48.215177823-05:00  INFO process_item{url=s3://dandiarchive/blobs/000/09b/00009ba5-2a5c-48f9-80bb-5225e3a4ad53?versionId=XSGTg8XtMu.ickBuN4QpgPpmij9U44an}: s3invsync::syncer
: Finished processing object
Error: failed to get object at s3://dandiarchive/blobs/000/162/0001628a-30b9-4bd2-b7c8-0629c91926ae?versionId=93syFLQ5I1pfmGowVrRPw.QDSxxsVm8f                                               
Caused by:
    0: service error
    1: unhandled error (AccessDenied)
    2: Error { code: "AccessDenied", message: "Access Denied", s3_extended_request_id: "yOQ9Cp+tCMmDZWXQrCBXGI/LY/LyrkrsjZGT/9PMrD6RAIXsbuTw6dzaF1KC+oR7VZxLJGz2CpM=", aws_request_id: "MN935K
4Q2C5AW16M" }

Would you be able to provide me with read permissions to all data on the s3://dandiarchive/ bucket? Thanks.

@kabilar
Copy link
Member Author

kabilar commented Feb 3, 2025

Hi @satra, following up here. @yarikoptic @waxlamp Do either of you happen to have admin permissions for this AWS account? Thanks.

@waxlamp
Copy link
Member

waxlamp commented Feb 3, 2025

I believe I do, after a fashion. Let's meet to come up with the right solution here.

@yarikoptic
Copy link
Member

sorry, I think I do not have creds to manage open data bucket. I do have IAM creds though which allow for data access. I guess here we need a dedicated "read-only" backup IAM

@satra
Copy link
Member

satra commented Feb 3, 2025

provided separately. however, let's also make sure s3invsync doesn't error out for access denied. simply raises that there were files that were not synced because of permissions and logs them somewhere.

@kabilar
Copy link
Member Author

kabilar commented Feb 5, 2025

provided separately.

Thanks. I have started the download to /orcd/data/dandi/001/s3dandiarchive.

however, let's also make sure s3invsync doesn't error out for access denied. simply raises that there were files that were not synced because of permissions and logs them somewhere.

In our use case, I do like that it errors out by default if we don't provide credentials since we need all open and embargoed assets. I will file an issue to add an option to not error out.

@kabilar
Copy link
Member Author

kabilar commented Feb 5, 2025

Hi @satra, it looks like the credentials don't allow for accessing the S3 Object Version. See error message below:

Error: failed to get object at s3://dandiarchive/blobs/000/162/0001628a-30b9-4bd2-b7c8-0629c91926ae?versionId=93syFLQ5I1pfmGowVrRPw.QDSxxsVm8f

Caused by:
    0: service error
    1: unhandled error (AccessDenied)
    2: Error { code: "AccessDenied", message: "User: arn:aws:iam::769362853226:user/backup is not authorized to perform: s3:GetObjectVersion on resource: \"arn:aws:s3:::dandiarchive/blobs/000/162/0001628a-30b9-4bd2-b7c8-0629c91926ae\" with an explicit deny in a resource-based policy", s3_extended_request_id: "dXm5l/gub97tNfHrYWkwRcy+M99V9wI9t1bZ6hkVGOnWMn3NXSE1S/RmVExNOgc0l7S8CII73/bcMH05ROwIk7JCuy8V4qFUWVL8Phi3UkM=", aws_request_id: "EADEXRJC6E14QJ1H" }

@kabilar
Copy link
Member Author

kabilar commented Feb 5, 2025

Looks like the AmazonS3ReadOnlyAccess policy may not work. Will need to create a policy to include s3:ListBucketVersions. StackOverflow post

@satra
Copy link
Member

satra commented Feb 5, 2025

we can add that to the policy, but check with @jwodder as i think the unit-tests in the s3invsync is using this same policy. we may want to know why it works there and not for this.

@jwodder
Copy link
Member

jwodder commented Feb 5, 2025

@satra The s3invsync tests do not do any I/O. All actual testing of the program in action has been done manually, and since the dandiarchive bucket is so huge, it's always been tested with a path filter that excludes embargoed assets and then some.

@kabilar
Copy link
Member Author

kabilar commented Feb 5, 2025

cc @aaronkanzer

@kabilar
Copy link
Member Author

kabilar commented Feb 6, 2025

provided separately. however, let's also make sure s3invsync doesn't error out for access denied. simply raises that there were files that were not synced because of permissions and logs them somewhere.

Filed dandi/s3invsync#158

@kabilar
Copy link
Member Author

kabilar commented Feb 6, 2025

Sync of the DANDI Open Data bucket to the MIT Engaging Cluster (/orcd/data/dandi/001/s3dandiarchive) is underway for the public assets. It has been running for ~1 hour and download ~10 GB.

@yarikoptic
Copy link
Member

that's underwhelming. any idea if CPU or network or ... is bottleneck.
What was the exact invocation? I would like to see how much I would get through on drogon within 1 hour.

@kabilar
Copy link
Member Author

kabilar commented Feb 7, 2025

that's underwhelming. any idea if CPU or network or ... is bottleneck.

Agreed. I am looking into it. I will start it on a node with more cores.

What was the exact invocation? I would like to see how much I would get through on drogon within 1 hour.

cd /home/kabi/s3invsync
cargo run --release -- --ok-errors access-denied s3://dandiarchive/dandiarchive/dandiarchive/ /orcd/data/dandi/001/s3dandiarchive/

@kabilar
Copy link
Member Author

kabilar commented Feb 10, 2025

It has been running for ~1 hour and download ~10 GB.

@yarikoptic @aaronkanzer Currently using a node with 32 cores and 256 GB. Download speed is now ~1 GB / minute.

@kabilar
Copy link
Member Author

kabilar commented Feb 11, 2025

Looks like the AmazonS3ReadOnlyAccess policy may not work. Will need to create a policy to include s3:ListBucketVersions. StackOverflow post

Hi @satra, following up here. Could you please add s3:ListBucketVersions to the policy? Thanks.

@satra
Copy link
Member

satra commented Feb 11, 2025

try now.

@kabilar
Copy link
Member Author

kabilar commented Feb 11, 2025

Unfortunately that didn't seem to work. I will test on the LINC private bucket to determine which policy(s) we need, and then get back to you.

failed to get object at s3://dandiarchive/blobs/000/38a/00038a91-c065-483e-a10e-1720732dbb2b?versionId=wke1N1PMq9V4q3HVjOGOqvvMCXKalWHs

Caused by:
    0: service error
    1: unhandled error (AccessDenied)
    2: Error { code: "AccessDenied", message: "User: arn:aws:iam::769362853226:user/backup is not authorized to perform: s3:GetObjectVersion on resource: \"arn:aws:s3:::dandiarchive/blobs/000/38a/00038a91-c065-483e-a10e-1720732dbb2b\" with an explicit deny in a resource-based policy", aws_request_id: "DYASS7EN76VZSZR6", s3_extended_request_id: "Lq9YfQTjYO1PC1S3P/qgJiCiyGf52jw6s/cuLIGu3mnbZFZO6OcWgJIta64qlZCv+4fHQaICEG7paPyS2mwHMg==" }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants