-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor into more lightweight service to be deployed not only on drogon #64
Comments
@yarikoptic Here's a possible new flow of operations for performing a backup; would this be acceptable?
Questions & Problems:
|
@yarikoptic Ping. |
I have a concern/desire for enhancement for
similarly to how we already "cache" access status ( Pros:
Cons:
Similarly, let's establish
You meant "superdataset", right? overall, this step is already performed, so all "good" here.
With above refactoring, it becomes pretty much disconnected from github, and could work with any other hosting in the scope of these particular changes.
I think we should keep them. Utopian me would have loved to be able to keep our hierarchy on Additional "concern(s)" from mine would be:
experiment 1 -- staging changes to subdataset and dropping it: we can do it, but we cannot them commit pointing to that subdataset path, so it would need to "stage everything" and "commit all staged changes"❯ echo "bogus: 123" >> 000003/dandiset.yaml
❯ git status
On branch draft
Your branch is up to date with 'origin/draft'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
(commit or discard the untracked or modified content in submodules)
modified: 000003 (modified content)
no changes added to commit (use "git add" and/or "git commit -a")
❯ datalad save -m "bogus update" 000003/dandiset.yaml
add(ok): dandiset.yaml (file)
save(ok): 000003 (dataset)
action summary:
add (ok: 1)
save (ok: 1)
❯ git status
On branch draft
Your branch is up to date with 'origin/draft'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: 000003 (new commits)
no changes added to commit (use "git add" and/or "git commit -a")
❯ git add 000003
❯ git status
On branch draft
Your branch is up to date with 'origin/draft'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: 000003
❯ datalad uninstall 000003
uninstall(error): 000003 (dataset) [to-be-dropped dataset has revisions that are not available at any known sibling. Use `datalad push --to ...` to push these before dropping the local dataset, or ignore via `--nocheck`. Unique revisions: ['draft']]
❯ datalad uninstall --nocheck 000003
uninstall(ok): 000003 (dataset)
❯ git status
On branch draft
Your branch is up to date with 'origin/draft'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: 000003
❯ git commit -m 'Committing already staged changes for 000003' 000003
error: '000003' does not have a commit checked out
fatal: updating files failed
❯ git commit -m 'Committing already staged changes for 000003'
[draft 1368fe3] Committing already staged changes for 000003
1 file changed, 1 insertion(+), 1 deletion(-)
❯ git show
commit 1368fe36d9a4155fb45ea1a7280ba3b29d2f46e7 (HEAD -> draft)
Author: Yaroslav Halchenko <debian@onerussian.com>
Date: Mon Feb 10 11:30:52 2025 -0500
Committing already staged changes for 000003
diff --git a/000003 b/000003
index 15772db..e613fce 160000
--- a/000003
+++ b/000003
@@ -1 +1 @@
-Subproject commit 15772db708c68cc37332b1e71f5d7e637716b95b
+Subproject commit e613fce640b060c2b456c8680397529cf35932bf
|
Yes.
The program would still be installing subdatasets from GitHub. Even if the tests mimicked this by installing/cloning from a local path, that would involve creating a full backup so that parts of it can be cloned by |
it would install subdatasets from wherever those subdatasets are listed available from within Given that the entire archive/API is easily deployable, as we do in dandi-cli, in principle we could script entire e2e interactions scenario, but would need to abstract those few interfaces we deal with github directly ATM into some adapter class and then having "local" version or later potentially even https://codeberg.org/forgejo-aneksajo/forgejo-aneksajo to get away from github into a system with native git-annex support ❯ git grep 'def.*github'
src/backups2datalad/__main__.py:async def update_github_metadata(
src/backups2datalad/adataset.py: async def has_github_remote(self) -> bool:
src/backups2datalad/adataset.py: async def create_github_sibling(
src/backups2datalad/datasetter.py: async def ensure_github_remote(self, ds: AsyncDataset, dandiset_id: str) -> None:
src/backups2datalad/datasetter.py: async def update_github_metadata(
src/backups2datalad/manager.py: async def edit_github_repo(self, repo: GHRepo, **kwargs: Any) -> None:
src/backups2datalad/manager.py: async def _set_github_description( |
But for testing purposes, the tests would need to create a backup of all Dandisets — separate from the one being operated on directly by the tested code — so it'd have something to clone, and the backups would have to be out of date so there'd be something to update. |
in principle, since our A quick&dirty alternative, if we would like to test really against what we have already available in the archive, we could create some |
@yarikoptic If we're storing Dandisets' last modified dates in the superdataset's |
Good question... Duplication is indeed evil, unless warranted ;) I feel like storing that information within dandiset makes sense due to containment, but I do fail to come up with a use-case when it would be needed since our update backups2datalad utility ATM does rely on having superdataset anyways. Moreover those dates largely correspond to the latest commit date (unless there was some manual fixup etc on top), so for a user it is possible to assess recency from those(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets$ for d in 0000*; do echo $d; grep -h timestamp $d/.dandi/assets-state.json; git -C $d show --format=fuller | grep AuthorDate; done | head -n 20
000003
"timestamp": "2024-05-18T17:13:27.131814Z"
AuthorDate: 2024 May 18 13:13:27 -0400
000004
"timestamp": "2024-05-18T17:53:27.096283Z"
AuthorDate: 2024 May 18 13:53:27 -0400
000005
"timestamp": "2023-06-20T00:56:15.296753Z"
AuthorDate: 2023 Jun 19 20:56:15 -0400
000006
"timestamp": "2024-05-18T17:47:27.045628Z"
AuthorDate: 2024 May 18 13:47:27 -0400
000007
"timestamp": "2024-05-18T17:46:27.027987Z"
AuthorDate: 2024 May 18 13:46:27 -0400
000008
"timestamp": "2024-05-18T18:01:28.459112Z"
AuthorDate: 2024 May 18 14:01:28 -0400
000009
"timestamp": "2024-05-18T17:45:27.168815Z"
Overall, I would not mind if you move storing that stamp within |
Would require review and redesign of some operations so they would not require maintaining local copy of all dandisets and zarrs but would retain only very lean "status" DB (json file) with timestamps per each dandiset and zarr repository so whenever update detected in the archive necessary components would be cloned, updated, pushed, removed locally.
Possible related issue
The text was updated successfully, but these errors were encountered: