Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise HashStore storeObject process #73

Closed
doulikecookiedough opened this issue Nov 2, 2023 · 3 comments
Closed

Revise HashStore storeObject process #73

doulikecookiedough opened this issue Nov 2, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@doulikecookiedough
Copy link
Contributor

Currently, when a Metacat client uploads a file, the form, metadata and stream to the object itself can come in any order. As a result, if the stream arrives first (before the form which contains the pid), we will be unable to call any storeObject and its overload methods.

Investigate how we could potentially revise our process, discuss the solution(s) with Jing and double check with the backend team. Once a solution has been accepted, implement the changes here and in HashStore-java.

@doulikecookiedough doulikecookiedough added enhancement New feature or request question Further information is requested labels Nov 2, 2023
@doulikecookiedough doulikecookiedough self-assigned this Nov 2, 2023
@doulikecookiedough doulikecookiedough changed the title Revise HashStore storeObject procedure Revise HashStore storeObject process Nov 2, 2023
@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Nov 6, 2023

To Review with @taojing2002 :

In this revised approach, I propose that we:

  1. Store objects with their content identifier (cid) as the permanent address
    • No longer using the sha256 hash of a given pid, reverting back to initial design choice
  2. Revise & add new Public API methods:
    • storeObject(InputStream objectStream)
    • validateObject(ObjectInfo objectInfo, String checksum, String checksumAlgorithm, long objSize)
    • tagObject(String cid, String pid)
  3. Utilize a reference file to keep track of whether an object has multiple references
    • The reference files are stored with the same permanent address as the cid in /refs, following the HashStore config depth and width

      Example folder layout for a single file stored along with its metadata and reference file
      # Notes:
      # - The reference for the pids contains the cid
      # - The reference for the cids contain the pids that reference the cid
      
      /objects
          └─ /d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2
      /metadata
          └─ /15/8d/7e/55c36a810d7c14479c9...b20d7df66768b04
      /refs
          └─ pids/0d/55/5e/d77052d7e166017f779...7230bcf7abcef65e
          └─ cids/d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2
      hashstore.yaml
      

Details:

  1. What is the new storeObject process?

    • Step 1: storeObject(InputStream objectStream)
      • This will place the object into HashStore into a tmp file, and returns the tmp file name along with the hex digest dictionary
    • Step 2: validateObject(ObjectInfo objectInfo, String checksum, String checksumAlgorithm, long objSize)
      • This is called by the Metacat client to ensure that the right object was stored
    • Step 3: tag/commitObject(String cid, String pid) or deleteObject(String cid, String pid)
      • Synchronized based on the given cid
      • If the ref file doesn't exist yet, we will acquire a system-wide file lock, create a tmp file, write the expected content (pid) into it, then move it to its cid permanent address, then release the lock
        • If the file already exists, we will acquire a system-wide file lock, read the file into memory, apply the changes, write the data to a tmp file, rename it, then release the lock
      • If validateObject throws an exception, HashStore will call deleteObject(String cid)
        • Else the object is successfully stored and tagObject(String cid, String pid) is called.
      • More on deleteObject
        • Calling deleteObject without a pid is possible, but will only proceed if there is an absence of a reference file
      • If the object is validated, it will be moved to its permanent address (content identifier based on the store's default algorithm)
  2. How does the refs keep track of pids and prevent accidental deletions?

    • Keeping track of references:
      • First, we get calculate the sha256 (or default algorithm) hash of the given pid to find the object in /refs/pids which will contain a single /cid
      • Then we look for the cid references in /refs/cids
      • If the file exists, we add a new pid on a new line
      • If it doesn't, we write the pid
       /refs
           └─ pids/0d/55/5e/d77052d7e166017f779...7230bcf7abcef65e
           └─ cids/d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2
       
       Content of refs/cids/d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2
       dou.test.1
       j.tao.1700.1
       j.tao.1700.1.2
      
    • Preventing Accidental Deletions:
      • When deleteObject is called, like tagObject, we synchronize based on the given cid
        • We acquire a system-wide file lock on the reference file, read it for the given cid and confirm it only contains a single reference, and that the reference is the one given (sha256(pid)) .
          • If there is more than one reference, we will modify the contents/remove the reference, write the updated data back to a tmp file, rename it and then release the lock
        • If there is only one reference, we delete the cid from /objects
        • Then we delete the reference file from /refs, and release the system-wide file lock
        • As the last step, we release the lock on the cid
  3. What happens if we are trying to write a reference file and delete it at the same time?

    • The cid object lock will be shared between tagObject and deleteObject, so they must execute sequentially

@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Nov 9, 2023

The comment above has been updated after discussing with @taojing2002. Note, we must clarify how the Public API will change regarding storeObject() and whether the new process becomes the norm, with the existing methods becoming the exception (one could use it, but the Metacat client won't). Alternatively, we can remove the existing storeObject() methods and only have one process to store an object. Lastly, we must carefully implement and test that the locking process.

I will first implement these changes in Python before moving onto Java.

@doulikecookiedough
Copy link
Contributor Author

This has been completed via Feature-73: store_object Refactor (with References)

Additional testing to be done during cross-language testing with HashStore-java's refactor/implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant