Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor into more lightweight service to be deployed not only on drogon #64

Open
yarikoptic opened this issue Feb 4, 2025 · 9 comments
Assignees

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Feb 4, 2025

Would require review and redesign of some operations so they would not require maintaining local copy of all dandisets and zarrs but would retain only very lean "status" DB (json file) with timestamps per each dandiset and zarr repository so whenever update detected in the archive necessary components would be cloned, updated, pushed, removed locally.

Possible related issue

  • last-modified is still not updated in the archive for some changes (TODO: refs here)
@jwodder
Copy link
Member

jwodder commented Feb 4, 2025

@yarikoptic Here's a possible new flow of operations for performing a backup; would this be acceptable?

  • Clone the superdataset, but do not populate any of its subdatasets
  • For each Dandiset in the Archive:
    • If one or more of the following are true, clone the subdataset and update it:
      • There is no subdataset for the Dandiset in the superdataset
      • The Dandiset's draft version's modified timestamp is later than the date stored in the .dandi/assets-state.json file in the HEAD commit of the GitHub repository for the backup (fetched via a GitHub blob URL)
        • Alternatively, the timestamp of the backup could be retrieved by querying the GitHub API for the author date of the HEAD commit
      • One or more published versions of the Dandiset do not have corresponding tags in the GitHub repository (as checked via the GitHub API)
    • When updating a Dandiset backup, for each Zarr in the Dandiset, if the Zarr backup needs an update as determined by the --zarr-mode option, clone the Zarr and update it
  • Make a commit to the subdataset that updates the states of all the updated Dandiset backups

Questions & Problems:

  • Since this workflow is based around cloning things from GitHub and accessing data over the GitHub API, how would it be tested?
  • Should the "force" and "verify" modes be kept? They would require cloning all of the Dandiset & Zarr repositories.

@jwodder
Copy link
Member

jwodder commented Feb 10, 2025

@yarikoptic Ping.

@yarikoptic
Copy link
Member Author

I have a concern/desire for enhancement for

The Dandiset's draft version's modified timestamp is later than the date stored in the .dandi/assets-state.json file in the HEAD commit of the GitHub repository for the backup (fetched via a GitHub blob URL)

similarly to how we already "cache" access status (github-access-status) within super-dataset .gitmodules, cache that timestamp from .dandi/assets-state.json there as well (e.g. as a dandiset-last-modified field).

Pros:

  • would not need to do 100s of queries to github API per update cycle

Cons:

  • any commit to update state of git submodule for a dandiset would require also a commit of .gitmodules. I think the price is minimal.
    • possible alternative here is to use https://git-scm.com/docs/git-notes to, within superdataset, add a note for updated subdataset timestamp for that subdataset update commit. This way it would not require commit of .gitmodules but on the other hand would add management (fetching/pushing) of those notes, and I am not really familiar with them and thus likely not worth the hassle here.

One or more published versions of the Dandiset do not have corresponding tags in the GitHub repository (as checked via the GitHub API)

Similarly, let's establish dandiset-git-tagged-versions within .gitmodules. As I do not expect any , in release versions, could be a single field to contain sorted and ,-joined versions. This would avoid other 100s of API calls.

  • In some --mode verify could potentially check that list corresponds.
  • Generally we would not need fetching all the tags, could just "extend" that list with new tags found locally (e.g. if a given update invocation created new tags).

Make a commit to the subdataset that updates the states of all the updated Dandiset backups

You meant "superdataset", right? overall, this step is already performed, so all "good" here.

Since this workflow is based around cloning things from GitHub and accessing data over the GitHub API, how would it be tested?

With above refactoring, it becomes pretty much disconnected from github, and could work with any other hosting in the scope of these particular changes.

Should the "force" and "verify" modes be kept? They would require cloning all of the Dandiset & Zarr repositories.

I think we should keep them. Utopian me would have loved to be able to keep our hierarchy on drogon for all dandisets and dandizarrs, be able to update them to get in sync with github, and if so desired, run that force or verify manually on drogon. So may be cloneing should be done only if target path for dandisets does not already contain a clone. And 'uninstall' (see candidate 1) should be attempted only if subdataset was in fact installed and not already present on the system.

Additional "concern(s)" from mine would be:

  • it might be prohibitive to install all of the subdatasets at the same time. E.g. imagine all 700 dandisets got their metadata updated

    • candidate 1: stage subdataset and uninstall as soon as staged. see "experiment 1" below -- should be possible
    • candidate 2: we might need to "batch" updates: "as soon as next 10 dandisets found to be updated", record updated states within .gitmodules, commit those 10, and then datalad uninstall those subdatasets. Might be cumbersome etc
  • if we implement it as CI job on github -- we could add con/tinuous monitoring/backup of logs (great!)

  • to ease troubleshooting - would be great to create a tarball of a failed state and upload as artifact -- then con/tinuous could fetch/archive it. ATM superdataset is just 24M of .git/objects.

    • I do not think it would be feasible to attach entire hierarchy of dandisets since might be too large in a given invocation
experiment 1 -- staging changes to subdataset and dropping it: we can do it, but we cannot them commit pointing to that subdataset path, so it would need to "stage everything" and "commit all staged changes"
echo "bogus: 123" >> 000003/dandiset.yaml
❯ git status
On branch draft
Your branch is up to date with 'origin/draft'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
	modified:   000003 (modified content)

no changes added to commit (use "git add" and/or "git commit -a")
❯ datalad save -m "bogus update" 000003/dandiset.yaml
add(ok): dandiset.yaml (file)                                                                                 
save(ok): 000003 (dataset)                                                                                    
action summary:                                                                                               
  add (ok: 1)
  save (ok: 1)
❯ git status
On branch draft
Your branch is up to date with 'origin/draft'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   000003 (new commits)

no changes added to commit (use "git add" and/or "git commit -a")
❯ git add 000003
❯ git status
On branch draft
Your branch is up to date with 'origin/draft'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	modified:   000003

❯ datalad uninstall 000003
uninstall(error): 000003 (dataset) [to-be-dropped dataset has revisions that are not available at any known sibling. Use `datalad push --to ...` to push these before dropping the local dataset, or ignore via `--nocheck`. Unique revisions: ['draft']]
❯ datalad uninstall --nocheck 000003
uninstall(ok): 000003 (dataset)                                                                               
❯ git status
On branch draft
Your branch is up to date with 'origin/draft'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	modified:   000003

❯ git commit -m 'Committing already staged changes for 000003' 000003
error: '000003' does not have a commit checked out
fatal: updating files failed
❯ git commit -m 'Committing already staged changes for 000003'
[draft 1368fe3] Committing already staged changes for 000003
 1 file changed, 1 insertion(+), 1 deletion(-)
❯ git show
commit 1368fe36d9a4155fb45ea1a7280ba3b29d2f46e7 (HEAD -> draft)
Author: Yaroslav Halchenko <debian@onerussian.com>
Date:   Mon Feb 10 11:30:52 2025 -0500

    Committing already staged changes for 000003

diff --git a/000003 b/000003
index 15772db..e613fce 160000
--- a/000003
+++ b/000003
@@ -1 +1 @@
-Subproject commit 15772db708c68cc37332b1e71f5d7e637716b95b
+Subproject commit e613fce640b060c2b456c8680397529cf35932bf

@jwodder
Copy link
Member

jwodder commented Feb 10, 2025

@yarikoptic

You meant "superdataset", right?

Yes.

With above refactoring, it becomes pretty much disconnected from github,

The program would still be installing subdatasets from GitHub. Even if the tests mimicked this by installing/cloning from a local path, that would involve creating a full backup so that parts of it can be cloned by backups2datalad, which seems awkward.

@yarikoptic
Copy link
Member Author

it would install subdatasets from wherever those subdatasets are listed available from within .gitmodules.

Given that the entire archive/API is easily deployable, as we do in dandi-cli, in principle we could script entire e2e interactions scenario, but would need to abstract those few interfaces we deal with github directly ATM into some adapter class and then having "local" version or later potentially even https://codeberg.org/forgejo-aneksajo/forgejo-aneksajo to get away from github into a system with native git-annex support

❯ git grep 'def.*github'
src/backups2datalad/__main__.py:async def update_github_metadata(
src/backups2datalad/adataset.py:    async def has_github_remote(self) -> bool:
src/backups2datalad/adataset.py:    async def create_github_sibling(
src/backups2datalad/datasetter.py:    async def ensure_github_remote(self, ds: AsyncDataset, dandiset_id: str) -> None:
src/backups2datalad/datasetter.py:    async def update_github_metadata(
src/backups2datalad/manager.py:    async def edit_github_repo(self, repo: GHRepo, **kwargs: Any) -> None:
src/backups2datalad/manager.py:    async def _set_github_description(

@jwodder
Copy link
Member

jwodder commented Feb 10, 2025

@yarikoptic

it would install subdatasets from wherever those subdatasets are listed available from within .gitmodules.

But for testing purposes, the tests would need to create a backup of all Dandisets — separate from the one being operated on directly by the tested code — so it'd have something to clone, and the backups would have to be out of date so there'd be something to update.

@yarikoptic
Copy link
Member Author

in principle, since our update operation also taking care about creating new repos for new dandisets and github repos for them (create_github_sibling), a test could establish first some version of a dandiset which would be cloned. There could even be a fixture to create such an initial state.

A quick&dirty alternative, if we would like to test really against what we have already available in the archive, we could create some dandisets-testing superdandiset with only e.g. 000027 dandiset and pointing to the state corresponding to some prior version, and invoking update for 000027 alone. Then running the test should update it to its current state within that superdataset but mocking away actual pushing to github.

@jwodder
Copy link
Member

jwodder commented Feb 12, 2025

@yarikoptic If we're storing Dandisets' last modified dates in the superdataset's .gitmodules, is there any point in keeping the .dandi/assets-state.json files in the Dandiset backups?

@yarikoptic
Copy link
Member Author

Good question... Duplication is indeed evil, unless warranted ;) I feel like storing that information within dandiset makes sense due to containment, but I do fail to come up with a use-case when it would be needed since our update backups2datalad utility ATM does rely on having superdataset anyways.

Moreover those dates largely correspond to the latest commit date (unless there was some manual fixup etc on top), so for a user it is possible to assess recency from those
(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets$ for d in 0000*; do echo $d; grep -h timestamp $d/.dandi/assets-state.json; git -C $d show --format=fuller | grep AuthorDate; done | head -n 20
000003
    "timestamp": "2024-05-18T17:13:27.131814Z"
AuthorDate: 2024 May 18 13:13:27 -0400
000004
    "timestamp": "2024-05-18T17:53:27.096283Z"
AuthorDate: 2024 May 18 13:53:27 -0400
000005
    "timestamp": "2023-06-20T00:56:15.296753Z"
AuthorDate: 2023 Jun 19 20:56:15 -0400
000006
    "timestamp": "2024-05-18T17:47:27.045628Z"
AuthorDate: 2024 May 18 13:47:27 -0400
000007
    "timestamp": "2024-05-18T17:46:27.027987Z"
AuthorDate: 2024 May 18 13:46:27 -0400
000008
    "timestamp": "2024-05-18T18:01:28.459112Z"
AuthorDate: 2024 May 18 14:01:28 -0400
000009
    "timestamp": "2024-05-18T17:45:27.168815Z"

Overall, I would not mind if you move storing that stamp within .dandi/assets-state.json solely into superdataset .gitmodules to avoid "triplication" of this metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants