Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarking of various mounting strategies #67

Merged
merged 34 commits into from
Jun 11, 2024
Merged

Add benchmarking of various mounting strategies #67

merged 34 commits into from
Jun 11, 2024

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Jan 19, 2024

Closes #66.

To do:

  • Implement mounting & unmounting with webdavfs
  • Implement mounting & unmounting with davfs2
  • Implement timing of the following tests:
    • pynwb_open_load_ns
    • matnwb_nwbRead
    • DANDI_CACHE=ignore dandi ls (to load metadata) on a single local asset
  • Add a subcommand for doing all the benchmarking at once (mounting each mount and running & timing of tests)
    • Emit a summary of timing results
    • Add an option for only using specified mount types
    • Add an option for whether to update the local clone of the dandisets repo for fusefs
  • Add a subcommand that just runs & times the tests against a path given on the command line
  • When mounting with fusefs, clone the dandisets repository first if it isn't present at the dataset path.
  • PROBLEM: webdavfs doesn't support redirects: Support redirects for non-collection resources miquels/webdavfs#30
    • Comment out the webdavfs support for now
  • Document how to set up davfs2 for use by dandisets-healthstatus
  • Actually run benchmarks
    • Run on smaug
    • The assets to test should be one (or more?) sample assets of some "typical" size (a few GBs). sub-mouse1-fni16/sub-mouse1-fni16_ses-161228151100.nwb in 000016 is suggested as a possible candidate.

@jwodder jwodder added the enhancement New feature or request therefor label Jan 19, 2024
@jwodder jwodder self-assigned this Jan 19, 2024
Copy link

codecov bot commented Jan 19, 2024

Codecov Report

Attention: Patch coverage is 64.25703% with 89 lines in your changes missing coverage. Please review.

Project coverage is 60.71%. Comparing base (d71a25d) to head (e331c4c).
Report is 78 commits behind head on main.

Files Patch % Lines
code/src/healthstatus/mounts.py 57.27% 46 Missing and 1 partial ⚠️
code/src/healthstatus/tests.py 69.35% 19 Missing ⚠️
code/src/healthstatus/__main__.py 61.53% 10 Missing ⚠️
code/src/healthstatus/checker.py 64.00% 8 Missing and 1 partial ⚠️
code/src/healthstatus/core.py 87.50% 3 Missing ⚠️
code/src/healthstatus/util.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #67      +/-   ##
==========================================
+ Coverage   60.43%   60.71%   +0.27%     
==========================================
  Files           9       10       +1     
  Lines         685      840     +155     
  Branches      169      193      +24     
==========================================
+ Hits          414      510      +96     
- Misses        251      310      +59     
  Partials       20       20              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@satra
Copy link
Member

satra commented Jan 19, 2024

since s3 can have different latencies at different times of the day, let's also make sure we have some estimate of s3 latency during each benchmark if these benchmarks take some time to run. if they run in mins, i would be less worried about latencies, and in such a scenario we should just get multiple estimates to create some error bars.

@jwodder
Copy link
Member Author

jwodder commented Jan 19, 2024

@satra

make sure we have some estimate of s3 latency

How?

@satra
Copy link
Member

satra commented Jan 19, 2024

this is old, but something like this: https://github.com/dvassallo/s3-benchmark

@jwodder
Copy link
Member Author

jwodder commented Jan 22, 2024

@satra This seems like something that should be done separately from dandisets-healthstatus. Trying to integrate it into this PR doesn't seem sensible.

@jwodder jwodder force-pushed the gh-66 branch 2 times, most recently from c237d13 to 2c7bebe Compare January 22, 2024 21:00
@jwodder
Copy link
Member Author

jwodder commented Jan 23, 2024

@yarikoptic Problem: dandisets-healthstatus requires Pydantic 2.0, yet this PR adds an extra dependency on dandi (and dandidav, which requires dandi), which still requires Pydantic 1.x.

@yarikoptic
Copy link
Member

since it all in motion, I think it would be ok to point to that branch you have for dandi-cli with pydantic 2.0 compat

tools/run.sh Outdated
@@ -3,7 +3,7 @@ set -ex

PYTHON="$HOME"/miniconda3/bin/python
DANDISETS_PATH=/mnt/backup/dandi/dandisets-healthstatus/dandisets
MOUNT_PATH=/mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse
MOUNT_PATH=/tmp/dandisets-fuse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do it under some safer user-specific location, e.g. /var/run/user/$UID/.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse is the path we've been using on drogon for FUSE-mounting the Dandisets. Also, tools/run.sh is just for generating the healthstatus reports; the shell script for generating the benchmarks is unfinished and hasn't been committed yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a general security concern/practice to avoid: As /tmp is allowed to be written by anyone, it is not recommended to rely on pre-defined paths under there, since attacker potentially cause a problem by establishing a symlink to the target file/location "on your behalf". /mnt/backup/dandi/dandisets-healthstatus/ is allowed to be written only by dedicated user(s).

Copy link
Member Author

@jwodder jwodder Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to update the allowed sudo mount commands to operate on /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse instead of /tmp/dandisets-fuse, and also update /etc/davfs2/davfs2.conf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, on drogon I made following changes under /etc

diff --git a/davfs2/davfs2.conf b/davfs2/davfs2.conf
index 2f0053e..e6034e3 100644
--- a/davfs2/davfs2.conf
+++ b/davfs2/davfs2.conf
@@ -46,7 +46,7 @@
 # ignore_dav_header 0
 # use_compression 0
 # min_propset     0
-# follow_redirect 0
+follow_redirect 1
 # sharepoint_href_bug 0
 # server_charset
 # connect_timeout 10                # seconds
diff --git a/sudoers.d/fuse b/sudoers.d/fuse
new file mode 100644
index 0000000..fa21fd9
--- /dev/null
+++ b/sudoers.d/fuse
@@ -0,0 +1,4 @@
+jwodder ALL=(ALL:ALL) NOPASSWD: /usr/bin/umount /tmp/dandisets-fuse
+jwodder ALL=(ALL:ALL) NOPASSWD: /usr/bin/mount -t davfs https\://webdav.dandiarchive.org /tmp/dandisets-fuse
+jwodder ALL=(ALL:ALL) NOPASSWD: /usr/bin/umount /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse
+jwodder ALL=(ALL:ALL) NOPASSWD: /usr/bin/mount -t davfs https\://webdav.dandiarchive.org /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse

@yarikoptic
Copy link
Member

Please provide results of running such benchmarking across possible solutions e.g. on typhon. (should be less busy ATM)

@yarikoptic
Copy link
Member

scrape that about typhon, I forgot that we rely on having dandisets around. Please do it on drogon.

@jwodder
Copy link
Member Author

jwodder commented Mar 27, 2024

@yarikoptic You initially said here to run the benchmarks on smaug.

@yarikoptic
Copy link
Member

if benchmarks rely on full clone of dandisets/ hierarchy, probably best to just run on drogon. If you want to replicate the hierarchy then indeed can do on smaug or typhon. Choose the host you deem most appropriate for this.

@jwodder
Copy link
Member Author

jwodder commented Mar 27, 2024

@yarikoptic I need permission to sudo-run the following commands on smaug:

/usr/bin/mount -t webdavfs -o allow_other https://webdav.dandiarchive.org /tmp/dandisets-fuse
/usr/bin/mount -t davfs https://webdav.dandiarchive.org /tmp/dandisets-fuse

Note that the colons in the URLs need to be escaped when adding them to the sudoers file.

Also, follow_redirect in /etc/davfs2/davfs2.conf needs to be set to 1.

@yarikoptic
Copy link
Member

done

@jwodder
Copy link
Member Author

jwodder commented Mar 29, 2024

@yarikoptic 42 seconds

@yarikoptic
Copy link
Member

hm. Any ideas on why fuse solution takes that long? how long it takes with datalad-fuse?

@jwodder
Copy link
Member Author

jwodder commented Mar 29, 2024

@yarikoptic I don't know why it's so slow with FUSE, and I don't know how long it would take with FUSE, as the benchmark code kills the process at the 2-hour timeout.

@yarikoptic
Copy link
Member

please make time out 5 hours and run against both fuse solutions -- datalad-fuse and dandidav + davfs2

@yarikoptic
Copy link
Member

ideally: profile datalad-fuse while running the test to see where it spends time.

@jwodder
Copy link
Member Author

jwodder commented Apr 1, 2024

@yarikoptic The matnwb test on datalad-fuse exceeded the five-hour time limit as well.

How exactly should I profile it? Just use py-spy?

@yarikoptic
Copy link
Member

First - py-spy would not hurt indeed.

Then I would have probably added log lines at DEBUG level within datalad-fuse to see what is actually taking time there if py-spy was not conclusive.

@jwodder
Copy link
Member Author

jwodder commented May 30, 2024

@yarikoptic Is there a way to get datalad's logs to include timestamps?

@yarikoptic
Copy link
Member

yes, there is also a number of other possibly helpful options (available through env vars or even git config since defined in common_cfg) for augmenting logging behavior:

❯ pwd
/home/yoh/proj/datalad/datalad-maint
❯ grep DATALAD_LOG CONTRIBUTING.md
- *DATALAD_LOG_LEVEL*:
- *DATALAD_LOG_NAME*:
- *DATALAD_LOG_OUTPUTS*:
- *DATALAD_LOG_PID*
- *DATALAD_LOG_TARGET*
- *DATALAD_LOG_TIMESTAMP*:
- *DATALAD_LOG_TRACEBACK*:
- *DATALAD_LOG_VMEM*:

@jwodder
Copy link
Member Author

jwodder commented May 31, 2024

Disregard

@yarikoptic When I try running datalad -l debug fusefs ... on smaug with DATALAD_LOG_TIMESTAMP=1 set, it crashes with:

Traceback (most recent call last):
  File "/bin/datalad", line 33, in <module>
    sys.exit(load_entry_point('datalad==0.19.5', 'console_scripts', 'datalad')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/bin/datalad", line 25, in importlib_load_entry_point
    return next(matches).load()
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/metadata/__init__.py", line 202, in load
    module = import_module(match.group('module'))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/lib/python3/dist-packages/datalad/__init__.py", line 112, in <module>
    cfg = ConfigManager()
          ^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/datalad/config.py", line 399, in __init__
    self.reload(force=True)
  File "/usr/lib/python3/dist-packages/datalad/config.py", line 460, in reload
    self._stores[store_id] = self._reload(runargs)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/datalad/config.py", line 488, in _reload
    stdout, stderr = self._run(
                     ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/datalad/config.py", line 869, in _run
    out = self._runner.run(self._config_cmd + args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/datalad/runner/runner.py", line 242, in run
    raise CommandError(
datalad.runner.exception.CommandError: CommandError: 'git --git-dir=/dev/null config -z -l --show-origin' failed with exitcode 1 [err: '/usr/lib/git-annex.linux/git: 2: readlink: not found
/usr/lib/git-annex.linux/git: 6: dirname: not found
** cannot find base directory (I seem to be /usr/lib/git-annex.linux/git)']

I don't know why the datalad executable at /bin/datalad would be executed; when datalad fusefs is run, the first item in PATH is the bin directory in a virtualenv that contains datalad, so I would expect that to be run instead.

EDIT: I realized that I wasn't setting the environment for the datalad fusefs command correctly, so PATH et alii were being wiped out.

@jwodder
Copy link
Member Author

jwodder commented May 31, 2024

@yarikoptic I believe sub-mouse1-fni16/sub-mouse1-fni16_ses-161228151100.nwb in 000016 was a poor choice of asset to test, as it has seemingly always timed out in the normal fusefs tests, and it continues to time out when testing out benchmarking. (Thus, as the timing-out is not specific to the benchmarking, if you want me to investigate it, you should file a separate issue.) Please choose another asset to test the benchmarking on, one that isn't currently marked as timing out.

@yarikoptic
Copy link
Member

Let's try on sub-mouse1-fni16/sub-mouse1-fni16_ses-170808184141.nwb in the same dandiset.. if I read yaml correctly it is ok for pynwb and errors out on matnwb (but does not timeout). In general -- feel welcome to choose any asset you deem appropriate and not too "easy" (fast)

@jwodder
Copy link
Member Author

jwodder commented Jun 3, 2024

@yarikoptic I finally got a run that didn't time out by using a 44 MB asset from Dandiset 000005. Here's the output, converted to a table:

Mount Type Dandiset Asset Test Time (s)
fusefs 000005 sub-anm236462/sub-anm236462_ses-20140210_behavior+icephys.nwb pynwb_open_load_ns 11.972222546115518
fusefs 000005 sub-anm236462/sub-anm236462_ses-20140210_behavior+icephys.nwb matnwb_nwbRead 447.08360775373876
fusefs 000005 sub-anm236462/sub-anm236462_ses-20140210_behavior+icephys.nwb dandi_ls 21.88728654384613
davfs2 000005 sub-anm236462/sub-anm236462_ses-20140210_behavior+icephys.nwb pynwb_open_load_ns 4.709459913894534
davfs2 000005 sub-anm236462/sub-anm236462_ses-20140210_behavior+icephys.nwb matnwb_nwbRead 18.62446365132928
davfs2 000005 sub-anm236462/sub-anm236462_ses-20140210_behavior+icephys.nwb dandi_ls 3.7525609582662582

@jwodder jwodder marked this pull request as ready for review June 3, 2024 20:07
@yarikoptic
Copy link
Member

So davfs2 is much more promising. What are timings for datalad-fuse for the same file? (since that is what we use ATM)

@jwodder
Copy link
Member Author

jwodder commented Jun 3, 2024

@yarikoptic Those are the "fusefs" entries.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall -- great, thank you!

Let's just avoid using pre-defined paths under /tmp/

@yarikoptic yarikoptic merged commit c51d691 into main Jun 11, 2024
16 checks passed
@yarikoptic yarikoptic deleted the gh-66 branch June 11, 2024 17:09
@jwodder jwodder added the asset access How the program gets the assets and lists thereof label Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
asset access How the program gets the assets and lists thereof enhancement New feature or request therefor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compare performance of webdav+davfs2, webdav+webdavfs, and datalad-fuse
3 participants