Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect checksum check in get_object when using Range="0-" for an object with composite checksum #3346

Open
1 task done
eric-higgins-ai opened this issue Jan 22, 2025 · 2 comments
Labels
bug This issue is a confirmed bug. p1 This is a high priority issue potential-regression Marking this issue as a potential regression to be checked by team member s3

Comments

@eric-higgins-ai
Copy link

eric-higgins-ai commented Jan 22, 2025

Describe the bug

Making a get_object request to the S3 API with and without Range="0-" for an object that was uploaded in multiple parts gives different checksums:

  • Without Range="0-" gives the checksum: UymZPQ==-7
  • With Range="0-" gives the checksum: UymZPQ==

When Range="0-" is specified, we don't hit the case here so we consider the checksum a "full object" checksum and try to check the integrity of the downloaded content against it. Clearly it isn't a "full object" checksum, so this case fails incorrectly.

The code linked above is copied here for convenience:

        # If a - is in the checksum this is not valid Base64. S3 returns
        # checksums that include a -# suffix to indicate a checksum derived
        # from the hash of all part checksums. We cannot wrap this response
        if "-" in headers[header_name]:
            continue

This is a regression in v1.36.0 due to boto/boto3#4392.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

The checksum check should always pass for a file that was downloaded with no corruption.

Current Behavior

The checksum check fails when calling get_object with Range="0-" with stack trace:

Traceback (most recent call last):
  File "/applied2/script.py", line 18, in <module>
    _response = client.get_object(Bucket=BUCKET_NAME, Key=S3_KEY, Range="bytes=0-")["Body"].read()
  File "/home/eric/.local/lib/python3.9/site-packages/botocore/httpchecksum.py", line 239, in read
    self._validate_checksum()
  File "/home/eric/.local/lib/python3.9/site-packages/botocore/httpchecksum.py", line 248, in _validate_checksum
    raise FlexibleChecksumError(error_msg=error_msg)
botocore.exceptions.FlexibleChecksumError: Expected checksum UymZPQ== did not match calculated checksum: DgOq1g==

Reproduction Steps

import boto3

MEGABYTE = 1024 * 1024
BUCKET_NAME = "lilypad-working-dir-zips"
S3_KEY = "key"

# Write a 50 MB file (large enough that upload_file will use a multipart upload).
with open("file", "w") as f:
    f.write("a" * 50 * MEGABYTE)

client = boto3.client("s3")
client.upload_file("file", BUCKET_NAME, S3_KEY)

# This call succeeds.
_response = client.get_object(Bucket=BUCKET_NAME, Key=S3_KEY)["Body"].read()

# This call fails.
_response = client.get_object(Bucket=BUCKET_NAME, Key=S3_KEY, Range="bytes=0-")["Body"].read()

Possible Solution

It may be possible to use the x-amz-checksum-type header to determine if the checksum is COMPOSITE and ignore it if so. However, I'm not sure if this is a good idea, because it will remove checksum verification for get_object requests that match upload parts and therefore do have a valid checksum.

Additional Information/Context

We're calling this API via smart-open and so don't have direct control over the get_object parameters. I can file an issue there if this is a misuse of boto3, but it seems to me like they're making a valid API call that should be fixed.

SDK version used

1.36.3

Environment details (OS name and version, etc.)

Ubuntu 22.04

@eric-higgins-ai eric-higgins-ai added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Jan 22, 2025
@github-actions github-actions bot added the potential-regression Marking this issue as a potential regression to be checked by team member label Jan 22, 2025
@jonathan343
Copy link
Contributor

jonathan343 commented Jan 22, 2025

Hey @eric-higgins-ai, thanks for bringing this issue to our attention!

I spent some time investigating this and was able to reproduce the issue you're describing.

The default behavior for botocore is to validate checksums sent by S3 when making GetObject requests if they send back a non-composite checksum value in a x-amz-checksum-<supported_algorithm> header. As you described, botocore will determine if a checksum is a composite type by checking for the existence of a - (read further down for why we don't use x-amz-checksum-type).

In the case you're describing, S3 seems to be dropping the -# from the checksum, causing botocore to incorrectly perform checksum validation. I've made the S3 team aware of this problem and will provide any updates on resolution in this GitHub issue.

All cases outside of Range="0-" seem to be working as expected. Until this issue is resolved, I'd suggest using a get_object request with no Range parameter instead of Range="0-" since this works as expected. I'd suggest opening an issue in smart-open for more specific assistance in your case.

Note: It is possible to disable default checksum validation by setting the response_checksum_validation config to when_required.

Why not use x-amz-checksum-type?
When uploading an object using high-level operations in boto3 like upload_file, our s3 transfer manager will perform a multi-part upload using specific parts sizes (default is 8MB) if the file is above a certain threshold (default is 8MB). S3 seems to be storing the full checksum of each part in addition to the composite checksum of all the parts (such as UymZPQ==-7 in your case).

When a user performs ranged get_object requests using the same size as the multi-part upload parts, S3 will send back a checksum for the SDK to validate against. In this case x-amz-checksum-type is still COMPOSITE, so if we used this, the SDK would skip validation which isn't desired.

Thanks again, and I'll keep you posted!

@khushail khushail added p1 This is a high priority issue s3 and removed needs-triage This issue or PR still needs to be triaged. labels Jan 23, 2025
@bikehara-ias
Copy link

FWIW, our team did run into an error where Range is not used, but the -# suffix was missing. I can DM the aws support case if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. p1 This is a high priority issue potential-regression Marking this issue as a potential regression to be checked by team member s3
Projects
None yet
Development

No branches or pull requests

4 participants