-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPR1-3159: Estimate size of MS produced by mvftoms #382
Comments
Let’s work through an example.
Our basic parameters are:
This results in 22 dumps * 544 corrprods / 4 pols = 22 * 136 = 2992 rows in the MS ✅ The basic block size / tile on disk seems to be 2 ** 21 = 2097152 bytes. All file sizes of The
|
This suggests the following formula for the main payload in the MS: import math
# The main table.fX_TSM0 files have sizes that are multiples of these
# block sizes ("bucket" sizes in CASA Tiled Storage Manager speak?).
# XXX This is empirically determined so far, maybe cross-check with code
# See `casacore.tables.table.getdminfo output` and look for `BucketSize`
BIG_BLOCK = 2 ** 21
SMALL_BLOCK = 2 ** 18
def col_size(n_cells, bits, block_size=BIG_BLOCK):
"""Round `n_cells` of `bits` bits up to next block size."""
return math.ceil(n_cells * bits / 8 / block_size) * block_size
def estimate_ms_size(n_dumps, n_chans, n_corrprods, n_pols=4):
"""Estimate MS size from basic MVF4 parameters."""
n_baselines = n_corrprods // 4
n_rows = n_dumps * n_baselines
n_cells = n_rows * n_pols
n_cells_per_spectrum = n_cells * n_chans
# Start with table.f0_TSM0 (not sure what that is, an index?)
size = 12 * SMALL_BLOCK
# DATA: complex64
size += col_size(n_cells_per_spectrum, bits=64)
# WEIGHT_SPECTRUM: float32
size += col_size(n_cells_per_spectrum, bits=32)
# SIGMA_SPECTRUM: float32
size += col_size(n_cells_per_spectrum, bits=32)
# FLAG: bit
size += col_size(n_cells_per_spectrum, bits=1, block_size=SMALL_BLOCK)
# FLAG_CATEGORY: bit
size += col_size(n_cells_per_spectrum, bits=1, block_size=SMALL_BLOCK)
# WEIGHT: float32
size += col_size(n_cells, bits=32)
# SIGMA: float32
size += col_size(n_cells, bits=32)
return size Try it out on the small dataset: In [10]: estimate_ms_size(22, 4096, 544)
Out[10]: 806354944
In [11]: !du -sb ms/test/
806914372 ms/test/
In [17]: 806354944 / 806914372
Out[17]: 0.9993067071062157 This now underestimates the size by 0.07%. Much better! |
Test this idea on the dataset in the original query: 1703007682.
Looks promising! Out of interest, the corresponding MVF4 size is 3319 * 4096 * 8064 * 10 / 1e12 = 1.096 TB. MS is not that efficient, probably because it stores both SIGMA_SPECTRUM and the redundant WEIGHT_SPECTRUM. Without the latter the MS size could have been 1.343 TB. The basic number of bytes per element differs like this:
|
The advantage of building this estimate into My suggestion is an |
Thanks @ludwigschwardt. I'm not following the logic above entirely. But I hazily understand that the size on disk is less than estimated by the archive website. I downloaded the .rdb file locally (instead of working from the remote at The command above took several minutes to run on my local, which is too long to do on the fly depending on a users's flag choice. Would it be possible to speed this up a lot (to a few seconds)? i.e. if data is coming from somewhere else, mount it locally + avoiding any file writes/ + any other optimizations. Or otherwise... is it possible to improve our estimations in a meaningful way? |
The RDB file is only the metadata (3 MB). The bulk of the data (755 MB) lives in Ceph as chunks / objects. The additional data is the data. 😊 You cannot run Yes, the estimations could be improved by incorporating the |
How can i tell the number of dumps unless I create the ms? (that's the reason that I thought you had to run mvtoms before estimating size). I see your point how pointless that would be... |
This is a dump of SPR1-3159 to facilitate external collaboration...
The calculation used in the archive web interface seems to overestimate the MS size by about 25%, which makes users worry that some of their data is missing:
This originates from katsdparchive in @ctgschollar's domain.
The last person who worked on this was Kgomotso (2 years ago) but he is not around anymore. It’s been in production for the past year at least.
The text was updated successfully, but these errors were encountered: