Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add StorageID to ObjectPointer #8522

Merged
merged 34 commits into from
Jan 22, 2025

Conversation

itaigilo
Copy link
Contributor

Closes #8496.

Change Description

Adding StorageID to ObjectPointer.
Adding StorageID to the init of ObjectPointer, that cover the basic flows of the server.

My approach in this PR

Search for all initializations of ObjectPointer, and:

  • Used repo.StorageID where possible.
  • Added a TODO if repo isn't present.
  • Added "" if it's a test.

What's Next

I'll open new Issues for the following, to be handled later on:

  • Add StorageID to objPointer of TierFS.
  • Add StorageID to GarbageCollectionManager.
  • Add StorageID to ResolveNamespace:
    • This is used by GC, so first we need to understand how to handle the GC.
    • Is DefaultResolveNamespace relevant to StorageID?
  • Re-search for value-reads of ObjectPointer.StorageNamespace, and see if they need ObjectID references.
  • Handle ensureStorageNamespace (it probably need to be moved to the Block interface, and add storageID to its signature).
  • Handle actions: OutputWrite should have StorageID, and so does HookOutputWriter.

@itaigilo itaigilo added the exclude-changelog PR description should not be included in next release changelog label Jan 20, 2025
@itaigilo itaigilo added the msb label Jan 20, 2025
@itaigilo itaigilo requested a review from N-o-Z January 20, 2025 21:45
Copy link

github-actions bot commented Jan 20, 2025

🎊 PR Preview b361b34 has been successfully built and deployed to https://treeverse-lakeFS-preview-pr-8522.surge.sh

🕐 Build time: 0.01s

🤖 By surge-preview

Copy link

E2E Test Results - DynamoDB Local - Local Block Adapter

13 passed

Copy link

E2E Test Results - Quickstart

11 passed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No clue why it wasn't added before, but I think it should be in master.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not in it's own PR then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following a discussion with @guy-har -
I'll remove from here it once #8527 will be merged to master (to avoid annoying conflicts).

@@ -2091,6 +2096,7 @@ func (c *Controller) ensureStorageNamespace(ctx context.Context, storageNamespac
// this serves two purposes, first, we maintain safety check for older lakeFS version.
// second, in scenarios where lakeFS shouldn't have access to the root namespace (i.e pre-sign URL only).
if c.Config.Graveler.EnsureReadableRootNamespace {
// TODO (gilo): ObjectPointer init - add StorageID here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change delayed? It's going in a different PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the scope of this one, and making it more reviewable (yet complete enough to be able to merge it in the end).

var blob *upload.Blob
if mediaType != "multipart/form-data" {
// handle non-multipart, direct content upload
address := c.PathProvider.NewPath()
blob, err = upload.WriteBlob(ctx, c.BlockAdapter, repo.StorageNamespace, address, r.Body, r.ContentLength,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this code style change indicate a change in our best practices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I don't think we have a code style guide for go,
And the formatter didn't seem to care.
These lines became too long.

@@ -2550,7 +2550,7 @@ func TestController_ObjectsHeadObjectHandler(t *testing.T) {
buf := new(bytes.Buffer)
buf.WriteString("this is file content made up of bytes")
address := upload.DefaultPathProvider.NewPath()
blob, err := upload.WriteBlob(context.Background(), deps.blocks, onBlock(deps, "ns1"), address, buf, 37, block.PutOpts{StorageClass: &expensiveString})
blob, err := upload.WriteBlob(context.Background(), deps.blocks, "", onBlock(deps, "ns1"), address, buf, 37, block.PutOpts{StorageClass: &expensiveString})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it valid to provide an empty string for the storageid? Shouldn't this trigger some sort of "invalid parameter" exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, empty StorageID is the only valid option.

@@ -63,6 +63,7 @@ const DefaultPreSignExpiryDuration = 15 * time.Minute
// ObjectPointer is a unique identifier of an object in the object
// store: the store is a 1:1 mapping between pointers and objects.
type ObjectPointer struct {
StorageID string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation: This is such an important core class class, but there is no explanation what sort of values these fields are ment to take. Is "S3" a valid storage ID? Or maybe a URL is expected like "https://bucketname.s3.amazonaws.com"? I have no idea when reading this code?

Same goes for the other two fields

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out of scope for this PR - we should however do this for the config part and document that well.
@itaigilo This is another thing we need to bring up with product

bucketName = "test"
ObjectBlockSize = 1024 * 3
storageNamespace = "test"
ObjectBlockSize = 1024 * 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this size measured in bits? bytes? gigabytes? Would be nice if I didn't have to guess(const name or docs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No clue -
only changed whitespaces here...

@@ -110,6 +110,7 @@ func (controller *GetObject) Handle(w http.ResponseWriter, req *http.Request, o
contentLength := entry.Size
contentRange := ""
objectPointer := block.ObjectPointer{
StorageID: o.Repository.StorageID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these will be "" fore existing repos? Is that valid? Don't we need a migrator to initialize the values for existing repos? Is that in this PR? Another issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, "" is valid, for existing repos.

Copy link
Member

@N-o-Z N-o-Z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - most of it is pretty straightforward.
I am however asking for changes on some of the open issues + some cleanups.

@@ -17,11 +17,12 @@ type Blob struct {
CreationDate time.Time
}

func WriteBlob(ctx context.Context, adapter block.Adapter, bucketName, address string, body io.Reader, contentLength int64, opts block.PutOpts) (*Blob, error) {
func WriteBlob(ctx context.Context, adapter block.Adapter, storageID, storageNamespace, address string, body io.Reader, contentLength int64, opts block.PutOpts) (*Blob, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I think we should just consider passing the ObjectPointer instead of all these params

@@ -419,6 +419,7 @@ func (tfs *TierFS) newLocalFileRef(namespace, nsPath, filename string) localFile
}

func (tfs *TierFS) objPointer(namespace, filename string) block.ObjectPointer {
// TODO (gilo): ObjectPointer init - add StorageID here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we planning on doing that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we need to propagate this in the hierarchy -
I'll soon do this in a separate PR, to not overload this PR.

objReader, err := c.BlockAdapter.Get(ctx, block.ObjectPointer{
Identifier: obj,
IdentifierType: block.IdentifierTypeFull,
StorageID: "",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary - can be removed

Comment on lines 3217 to 3219
storageNamespace := repo.StorageNamespace
storageID := repo.StorageID

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant - please remove

@@ -63,6 +63,7 @@ const DefaultPreSignExpiryDuration = 15 * time.Minute
// ObjectPointer is a unique identifier of an object in the object
// store: the store is a 1:1 mapping between pointers and objects.
type ObjectPointer struct {
StorageID string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out of scope for this PR - we should however do this for the config part and document that well.
@itaigilo This is another thing we need to bring up with product

@@ -139,6 +139,7 @@ func ResolveBlobURLInfoFromURL(pathURL *url.URL) (BlobURLInfo, error) {

func resolveBlobURLInfo(obj block.ObjectPointer) (BlobURLInfo, error) {
key := obj.Identifier
// TODO (gilo): verify there's no need for StorageID here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point you are already in the context of a specific adapter instance so storage ID is not needed here

@@ -411,6 +411,7 @@ func (a *Adapter) GetPreSignedURL(ctx context.Context, obj block.ObjectPointer,

log := a.log(ctx).WithFields(logging.Fields{
"operation": "GetPreSignedURL",
"sid": obj.StorageID, // TODO (gilo): is "sid" a good enough name?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not good enough :)
Let's be explicit (we don't have to be stingy in our logs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I haven't done this is that StorageNamespace is logged as namespace and not as storage_namespace 😖

@@ -466,6 +467,7 @@ func (a *Adapter) GetPresignUploadPartURL(ctx context.Context, obj block.ObjectP

log := a.log(ctx).WithFields(logging.Fields{
"operation": "GetPresignUploadPartURL",
"sid": obj.StorageID, // TODO (gilo): is "sid" a good enough name?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@@ -2630,7 +2630,7 @@ type UncommittedParquetObject struct {
CreationDate int64 `parquet:"name=creation_date, type=INT64, convertedtype=INT_64"`
}

func (c *Catalog) uploadFile(ctx context.Context, ns graveler.StorageNamespace, location string, fd *os.File, size int64) (string, error) {
func (c *Catalog) uploadFile(ctx context.Context, sid graveler.StorageID, ns graveler.StorageNamespace, location string, fd *os.File, size int64) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use sid, it's overloaded and confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the ns "convention".

@@ -2630,7 +2630,7 @@ type UncommittedParquetObject struct {
CreationDate int64 `parquet:"name=creation_date, type=INT64, convertedtype=INT_64"`
}

func (c *Catalog) uploadFile(ctx context.Context, ns graveler.StorageNamespace, location string, fd *os.File, size int64) (string, error) {
func (c *Catalog) uploadFile(ctx context.Context, sid graveler.StorageID, ns graveler.StorageNamespace, location string, fd *os.File, size int64) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point consider just passing the repository object, will be cleaner than just adding more params

Copy link
Contributor Author

@itaigilo itaigilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@N-o-Z fixed, PTAL again.

@@ -411,6 +411,7 @@ func (a *Adapter) GetPreSignedURL(ctx context.Context, obj block.ObjectPointer,

log := a.log(ctx).WithFields(logging.Fields{
"operation": "GetPreSignedURL",
"sid": obj.StorageID, // TODO (gilo): is "sid" a good enough name?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I haven't done this is that StorageNamespace is logged as namespace and not as storage_namespace 😖

@@ -2630,7 +2630,7 @@ type UncommittedParquetObject struct {
CreationDate int64 `parquet:"name=creation_date, type=INT64, convertedtype=INT_64"`
}

func (c *Catalog) uploadFile(ctx context.Context, ns graveler.StorageNamespace, location string, fd *os.File, size int64) (string, error) {
func (c *Catalog) uploadFile(ctx context.Context, sid graveler.StorageID, ns graveler.StorageNamespace, location string, fd *os.File, size int64) (string, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the ns "convention".

@@ -419,6 +419,7 @@ func (tfs *TierFS) newLocalFileRef(namespace, nsPath, filename string) localFile
}

func (tfs *TierFS) objPointer(namespace, filename string) block.ObjectPointer {
// TODO (gilo): ObjectPointer init - add StorageID here
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we need to propagate this in the hierarchy -
I'll soon do this in a separate PR, to not overload this PR.

@itaigilo itaigilo requested a review from N-o-Z January 22, 2025 09:37
Copy link
Member

@N-o-Z N-o-Z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful, thanks!!

-1,
block.PutOpts{StorageClass: params.StorageClass},
)
objectPointer := block.ObjectPointer{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@@ -2630,7 +2630,7 @@ type UncommittedParquetObject struct {
CreationDate int64 `parquet:"name=creation_date, type=INT64, convertedtype=INT_64"`
}

func (c *Catalog) uploadFile(ctx context.Context, sid graveler.StorageID, ns graveler.StorageNamespace, location string, fd *os.File, size int64) (string, error) {
func (c *Catalog) uploadFile(ctx context.Context, repo *graveler.RepositoryRecord, location string, fd *os.File, size int64) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@@ -411,7 +411,7 @@ func (a *Adapter) GetPreSignedURL(ctx context.Context, obj block.ObjectPointer,

log := a.log(ctx).WithFields(logging.Fields{
"operation": "GetPreSignedURL",
"sid": obj.StorageID, // TODO (gilo): is "sid" a good enough name?
"storage_id": obj.StorageID,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@itaigilo itaigilo enabled auto-merge (squash) January 22, 2025 21:06
@itaigilo itaigilo merged commit 49becea into master Jan 22, 2025
38 checks passed
@itaigilo itaigilo deleted the feature/add-storage-id-to-obj-ptr-part-1 branch January 22, 2025 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exclude-changelog PR description should not be included in next release changelog msb
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make ObjectPointer store aware
3 participants