Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use diskcache for caching ProtocolDAGResults in the Alchemiscale client #271

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

ianmkenney
Copy link
Member

Addresses, in part, #58. Using the diskcache library, we can cache ProtocolDAGResults client-side and reduce the number of API calls for calculating free energy differences.

* The default Disk used by diskcache uses pickle when storing python
  objects. Instead, we are now storing byte arrays. Depending on the
  size of the byte array, this is either stored in the SQLite3 DB or
  or as a separate file if it's too large (>32 kb by default).
* A test has been added that checks the hits and misses when pulling
  PDRs using the get_transformation_results method. The in-memory
  LRU cache is cleared manually for accurate stats.
New objects supported:
* Transformations
* AlchemicalNetworks
* ChemicalSystems
* Generally anything that can be a KeyedChain
* With known cached results, corrupt the values and make sure the user
  is warned that there was a problem with deserialization and that a
  new result will be downloaded.
* Lowered the cache size limit for tests to avoid running out of space
@ianmkenney ianmkenney linked an issue Apr 30, 2024 that may be closed by this pull request
* Removed unsused imports
The AlchemiscaleBaseClient now determines the cache directory
when one is not specified directly (i.e. a None is provided to the
AlchemiscaleBaseClient constructor). When a path to this directory is
provided, it must be a string or pathlib.Path object. The logic for
this operation lies in the `AlchemiscaleBaseClient._determine_cache_dir`
method, which can raise a TypeError on invalid input.

The `cache_size_limit` is now verified within the constructor to be
>= 0. If it is not, then a ValueError is raised.

New tests have been added for the above changes:

* Negative cache_size_limit: checks for constructor-raised ValueError
  with a meaningful message.

* cache_directory is None: checks output of the underlying
  _determine_cache_dir method with and without the XDG_CACHE_HOME
  environment variable. If we test it with the client constructor, the
  directory is made automatically, which we don't want in the tests as
  it may touch real data.

* cache_directory is not None, str, or Path: Check that the constructor
  raises a TypeError with a meaningful message.
@ianmkenney
Copy link
Member Author

@dotsdl new error I haven't seen yet, but doesn't seem to touch anything I changed. Will keep an eye out for it again, but going to assume it's a race condition of some sort for now.

This should be ready for review now!

@ianmkenney ianmkenney marked this pull request as ready for review April 30, 2024 21:15
@ianmkenney ianmkenney changed the title [WIP] Use diskcache for caching ProtocolDAGResults in the Alchemiscale client Use diskcache for caching ProtocolDAGResults in the Alchemiscale client Apr 30, 2024
@ianmkenney ianmkenney requested a review from dotsdl April 30, 2024 21:15
Copy link
Member

@dotsdl dotsdl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @ianmkenney! Some suggestions, but also one major request: let's perform zstandard compression/decompression of AlchemicalNetworks, Transformations, and ChemicalSystems added/retrieved from the disk cache.

alchemiscale/base/client.py Outdated Show resolved Hide resolved
alchemiscale/base/client.py Show resolved Hide resolved
alchemiscale/interface/client.py Outdated Show resolved Hide resolved

if content is None:
content = get_content_function()
keyedchain_json = json.dumps(content, cls=JSON_HANDLER.encoder)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add an optional kwarg to AlchemiscaleBaseClient._get_resource, such as return_json, that by default is False but we can set in the calls below to True so that we don't need to reserialize here.

content = None

try:
cached_keyed_chain = self._cache.get(str(scopedkey), None).decode("utf-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should store zstandard compressed bytes? We could use alchemiscale's own compress_gufe_zstd and decompress_gufe_zstd now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be simple to add this in.

if content is None:
content = get_content_function()
keyedchain_json = json.dumps(content, cls=JSON_HANDLER.encoder)
self._cache.add(str(scopedkey), keyedchain_json.encode("utf-8"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above on use of zstandard. Perhaps we rewrite this method to use the alchemiscale.compression functions to store and retrieve compressed objects?

Comment on lines +2056 to +2058
user_client.create_tasks(transformation_sk, count=3)

all_tasks = user_client.get_transformation_tasks(transformation_sk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
user_client.create_tasks(transformation_sk, count=3)
all_tasks = user_client.get_transformation_tasks(transformation_sk)
all_tasks = user_client.create_tasks(transformation_sk, count=3)

Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 83.92857% with 9 lines in your changes missing coverage. Please review.

Project coverage is 80.64%. Comparing base (5b5cbd6) to head (e058886).

Files with missing lines Patch % Lines
alchemiscale/interface/client.py 85.71% 5 Missing ⚠️
alchemiscale/base/client.py 80.95% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #271      +/-   ##
==========================================
+ Coverage   80.52%   80.64%   +0.11%     
==========================================
  Files          27       27              
  Lines        3743     3781      +38     
==========================================
+ Hits         3014     3049      +35     
- Misses        729      732       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ProtocolDAGResult caching to user-facing client
2 participants