diff --git a/README.md b/README.md index 0ddaac61..68a93a0a 100644 --- a/README.md +++ b/README.md @@ -16,20 +16,33 @@ DataONE in general, and HashStore in particular, are open source, community proj Documentation is a work in progress, and can be found on the [Metacat repository](https://github.com/NCEAS/metacat/blob/feature-1436-storage-and-indexing/docs/user/metacat/source/storage-subsystem.rst#physical-file-layout) as part of the storage redesign planning. Future updates will include documentation here as the package matures. -## Development build +## HashStore Overview -HashStore is a python package, and built using the [Python Poetry](https://python-poetry.org) build tool. +HashStore is a content-addressable file management system that utilizes the content identifier of an object to address files. The system stores both objects, references (refs) and metadata in its respective directories and provides an API for interacting with the store. HashStore storage classes (like `FileHashStore`) must implement the HashStore interface to ensure the expected usage of HashStore. -To install `hashstore` locally, create a virtual environment for python 3.9+, -install poetry, and then install or build the package with `poetry install` or `poetry build`, respectively. +###### Public API Methods +- store_object +- verify_object +- tag_object +- find_object +- store_metadata +- retrieve_object +- retrieve_metadata +- delete_object +- delete_metadata +- get_hex_digest + +For details, please see the HashStore interface (HashStore.java) -To run tests, navigate to the root directory and run `pytest -s`. The test suite contains tests that -take a longer time to run (relating to the storage of large files) - to execute all tests, run -`pytest --run-slow`. To see detailed -## Usage Example +###### How do I create a HashStore? -To view more details about the Public API - see 'hashstore.py` interface documentation +To create or interact with a HashStore, instantiate a HashStore object with the following set of properties: +- store_path +- store_depth +- store_width +- store_algorithm +- store_metadata_namespace ```py from hashstore import HashStoreFactory @@ -54,49 +67,148 @@ my_store = factory.get_hashstore(module_name, class_name, properties) # Store objects (.../[hashstore_path]/objects/) pid = "j.tao.1700.1" object = "/path/to/your/object.data" -hash_address = my_store.store_object(pid, object) -object_cid = hash_address.id +object_metadata = my_store.store_object(pid, object) +object_cid = object_metadata.id # Store metadata (.../[hashstore_path]/metadata/) # By default, storing metadata will use the given properties namespace `format_id` pid = "j.tao.1700.1" sysmeta = "/path/to/your/sysmeta/document.xml" metadata_cid = my_store.store_metadata(pid, sysmeta) -``` -If you want to store other types of metadata, add an additional `format_id`. -```py +# If you want to store other types of metadata, add an additional `format_id`. pid = "j.tao.1700.1" metadata = "/path/to/your/metadata/document.json" format_id = "http://custom.metadata.com/json/type/v1.0" metadata_cid = my_store.store_metadata(pid, metadata, format_id) + +# ... +``` + +###### Working with objects (store, retrieve, delete) + +In HashStore, objects are first saved as temporary files while their content identifiers are calculated. Once the default hash algorithm list and their hashes are generated, objects are stored in their permanent location using the store's algorithm's corresponding hash value, the store depth and the store width. Lastly, reference files are created for the object so that they can be found and retrieved given an identifier (ex. persistent identifier (pid)). Note: Objects are also stored once and only once. + +By calling the various interface methods for `store_object`, the calling app/client can validate, store and tag an object simultaneously if the relevant data is available. In the absence of an identfiier (ex. persistent identifier (pid)), `store_object` can be called to solely store an object. The client is then expected to call `verify_object` when the relevant metadata is available to confirm that the object has been stored as expected. And to finalize the process (to make the object discoverable), the client calls `tag_object``. In summary, there are two expected paths to store an object: + +```py +# All-in-one process which stores, validates and tags an object +objectMetadata objInfo = store_object(InputStream, pid, additionalAlgorithm, checksum, checksumAlgorithm, objSize) + +# Manual Process +# Store object +obj_metadata = store_object(InputStream) +# Validate object, throws exceptions if there is a mismatch and deletes the associated file +verify_object(objInfo, checksum, checksumAlgorithn, objSize) +# Tag object, makes the object discoverable (find, retrieve, delete) +tag_object(pid, cid) +``` + +**How do I retrieve an object if I have the pid?** +- To retrieve an object, call the Public API method `retrieve_object` which opens a stream to the object if it exists. + +**How do I find an object or check that it exists if I have the pid?** +- To find the location of the object, call the Public API method `find_object` which will return the content identifier (cid) of the object. +- This cid can then be used to locate the object on disk by following HashStore's store configuration. + +**How do I delete an object if I have the pid?** +- To delete an object, call the Public API method `delete_object` which will delete the object and its associated references and reference files where relevant. +- Note, `delete_object` and `tag_object` calls are synchronized on their content identifier values so that the shared reference files are not unintentionally modified concurrently. An object that is in the process of being deleted should not be tagged, and vice versa. These calls have been implemented to occur sequentially to improve clarity in the event of an unexpected conflict or issue. + + +###### Working with metadata (store, retrieve, delete) + +HashStore's '/metadata' directory holds all metadata for objects stored in HashStore. To differentiate between metadata documents for a given object, HashStore includes the 'format_id' (format or namespace of the metadata) when generating the address of the metadata document to store (the hash of the 'pid' + 'format_id'). By default, calling `store_metadata` will use HashStore's default metadata namespace as the 'format_id' when storing metadata. Should the calling app wish to store multiple metadata files about an object, the client app is expected to provide a 'format_id' that represents an object format for the metadata type (ex. `store_metadata(stream, pid, format_id)`). + +**How do I retrieve a metadata file?** +- To find a metadata object, call the Public API method `retrieve_metadata` which returns a stream to the metadata file that's been stored with the default metadata namespace if it exists. +- If there are multiple metadata objects, a 'format_id' must be specified when calling `retrieve_metadata` (ex. `retrieve_metadata(pid, format_id)`) + +**How do I delete a metadata file?** +- Like `retrieve_metadata`, call the Public API method `delete_metadata` which will delete the metadata object associated with the given pid. +- If there are multiple metadata objects, a 'format_id' must be specified when calling `delete_metadata` to ensure the expected metadata object is deleted. + + +###### What are HashStore reference files? + +HashStore assumes that every object to store has a respective identifier. This identifier is then used when storing, retrieving and deleting an object. In order to facilitate this process, we create two types of reference files: +- pid (persistent identifier) reference files +- cid (content identifier) reference files + +These reference files are implemented in HashStore underneath the hood with no expectation for modification from the calling app/client. The one and only exception to this process when the calling client/app does not have an identifier, and solely stores an objects raw bytes in HashStore (calling `store_object(stream)`). + +**'pid' Reference Files** +- Pid (persistent identifier) reference files are created when storing an object with an identifier. +- Pid reference files are located in HashStores '/refs/pid' directory +- If an identifier is not available at the time of storing an object, the calling app/client must create this association between a pid and the object it represents by calling `tag_object` separately. +- Each pid reference file contains a string that represents the content identifier of the object it references +- Like how objects are stored once and only once, there is also only one pid reference file for each object. + +**'cid' Reference Files** +- Cid (content identifier) reference files are created at the same time as pid reference files when storing an object with an identifier. +- Cid reference files are located in HashStore's '/refs/cid' directory +- A cid reference file is a list of all the pids that reference a cid, delimited by a new line ("\n") character + + +###### What does HashStore look like? + +``` +# Example layout in HashStore with a single file stored along with its metadata and reference files. +# This uses a store depth of 3, with a width of 2 and "SHA-256" as its default store algorithm +## Notes: +## - Objects are stored using their content identifier as the file address +## - The reference file for each pid contains a single cid +## - The reference file for each cid contains multiple pids each on its own line + +.../metacat/hashstore/ +└─ objects + └─ /d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2 +└─ metadata + └─ /15/8d/7e/55c36a810d7c14479c9...b20d7df66768b04 +└─ refs + └─ pid/0d/55/5e/d77052d7e166017f779...7230bcf7abcef65e + └─ cid/d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2 +hashstore.yaml ``` +## Development build + +HashStore is a python package, and built using the [Python Poetry](https://python-poetry.org) build tool. + +To install `hashstore` locally, create a virtual environment for python 3.9+, +install poetry, and then install or build the package with `poetry install` or `poetry build`, respectively. + +To run tests, navigate to the root directory and run `pytest -s`. The test suite contains tests that +take a longer time to run (relating to the storage of large files) - to execute all tests, run +`pytest --run-slow`. To see detailed + +## HashStore Client + How to use HashStore client (command line app) ```sh # Step 1: Create a HashStore -$ python './src/hashstore/client.py' /path/to/store/ -chs -dp=3 -wp=2 -ap=SHA-256 -nsp="http://www.ns.test/v1" +$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -chs -dp=3 -wp=2 -ap=SHA-256 -nsp="http://www.ns.test/v1" # Get the checksum of a data object -$ python './src/hashstore/client.py' /path/to/store/ -getchecksum -pid=content_identifier -algo=SHA-256 +$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -getchecksum -pid=content_identifier -algo=SHA-256 # Store a data object -$ python './src/hashstore/client.py' /path/to/store/ -storeobject -pid=content_identifier -path=/path/to/object +$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -storeobject -pid=content_identifier -path=/path/to/object # Store a metadata object -$ python './src/hashstore/client.py' /path/to/store/ -storemetadata -pid=content_identifier -path=/path/to/metadata/object -formatid=http://ns.dataone.org/service/types/v2.0 +$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -storemetadata -pid=content_identifier -path=/path/to/metadata/object -formatid=http://ns.dataone.org/service/types/v2.0 # Retrieve a data object -$ python './src/hashstore/client.py' /path/to/store/ -retrieveobject -pid=content_identifier +$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -retrieveobject -pid=content_identifier # Retrieve a metadata object -$ python './src/hashstore/client.py' /path/to/store/ -retrievemetadata -pid=content_identifier -formatid=http://ns.dataone.org/service/types/v2.0 +$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -retrievemetadata -pid=content_identifier -formatid=http://ns.dataone.org/service/types/v2.0 # Delete a data object -$ python './src/hashstore/client.py' /path/to/store/ -deleteobject -pid=content_identifier +$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -deleteobject -pid=content_identifier # Delete a metadata file -$ python './src/hashstore/client.py' /path/to/store/ -deletemetadata -pid=content_identifier -formatid=http://ns.dataone.org/service/types/v2.0 +$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -deletemetadata -pid=content_identifier -formatid=http://ns.dataone.org/service/types/v2.0 ``` ## License diff --git a/src/hashstore/filehashstore.py b/src/hashstore/filehashstore.py index cec12da5..108a406e 100644 --- a/src/hashstore/filehashstore.py +++ b/src/hashstore/filehashstore.py @@ -1857,10 +1857,12 @@ def delete(self, entity, file): try: os.remove(realpath) - except OSError: - pass - else: - self._remove_empty(os.path.dirname(realpath)) + except OSError as err: + exception_string = ( + f"FileHashStore - delete(): Unexpected {err=}, {type(err)=}" + ) + logging.error(exception_string) + raise err def _remove_empty(self, subpath): """Successively remove all empty folders starting with `subpath` and diff --git a/src/hashstore/client.py b/src/hashstore/hashstoreclient.py similarity index 100% rename from src/hashstore/client.py rename to src/hashstore/hashstoreclient.py diff --git a/tests/test_hashstore_client.py b/tests/test_hashstore_client.py index 96c9ad45..8f176452 100644 --- a/tests/test_hashstore_client.py +++ b/tests/test_hashstore_client.py @@ -2,7 +2,7 @@ import sys import os from pathlib import Path -from hashstore import client +from hashstore import hashstoreclient def test_create_hashstore(tmp_path): @@ -29,7 +29,7 @@ def test_create_hashstore(tmp_path): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() hashstore_yaml = Path(client_test_store + "/hashstore.yaml") hashstore_object_path = Path(client_test_store + "/objects") @@ -66,7 +66,7 @@ def test_get_checksum(capsys, store, pids): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() capsystext = capsys.readouterr().out expected_output = ( @@ -99,7 +99,7 @@ def test_store_object(store, pids): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() assert store.exists("objects", pids[pid][store.algorithm]) @@ -131,7 +131,7 @@ def test_store_metadata(store, pids): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() assert store.exists("metadata", pids[pid]["metadata_cid"]) @@ -159,7 +159,7 @@ def test_retrieve_objects(capsys, pids, store): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() object_stream = store.retrieve_object(pid) object_content = ( @@ -201,7 +201,7 @@ def test_retrieve_metadata(capsys, pids, store): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() metadata_stream = store.retrieve_metadata(pid, namespace) metadata_content = ( @@ -239,7 +239,7 @@ def test_delete_objects(pids, store): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() assert not store.exists("objects", pids[pid][store.algorithm]) @@ -271,6 +271,6 @@ def test_delete_metadata(pids, store): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() assert not store.exists("metadata", pids[pid]["metadata_cid"])