SPR1-3159: Estimate size of MS produced by mvftoms #382

ludwigschwardt · 2024-11-07T10:22:47Z

This is a dump of SPR1-3159 to facilitate external collaboration...

The calculation used in the archive web interface seems to overestimate the MS size by about 25%, which makes users worry that some of their data is missing:

This originates from katsdparchive in @ctgschollar's domain.

The last person who worked on this was Kgomotso (2 years ago) but he is not around anymore. It’s been in production for the past year at least.

ludwigschwardt · 2024-11-07T10:27:45Z

Let’s work through an example.

Pick a small dataset (1730279709, MVF4 size after mvf_download = 822 MB).
Run

mvftoms.py mvf4/1730279709/1730279709_sdp_l0.full.rdb -f --flags=cam,data_lost,ingest_rfi -o ms/test

Look at the directory created:

ls -la ms/test/

total 787732
drwxrwxr-x 15 kat kat       700 Oct 30 10:20 ./
drwxrwxr-x  3 kat kat        60 Oct 30 10:20 ../
drwxrwxr-x  2 kat kat       120 Oct 30 10:20 ANTENNA/
drwxrwxr-x  2 kat kat       120 Oct 30 10:20 DATA_DESCRIPTION/
drwxrwxr-x  2 kat kat       140 Oct 30 10:20 FEED/
drwxrwxr-x  2 kat kat       140 Oct 30 10:20 FIELD/
drwxrwxr-x  2 kat kat       120 Oct 30 10:20 FLAG_CMD/
drwxrwxr-x  2 kat kat       120 Oct 30 10:20 HISTORY/
drwxrwxr-x  2 kat kat       120 Oct 30 10:20 OBSERVATION/
drwxrwxr-x  2 kat kat       160 Oct 30 10:20 POINTING/
drwxrwxr-x  2 kat kat       140 Oct 30 10:20 POLARIZATION/
drwxrwxr-x  2 kat kat       120 Oct 30 10:20 PROCESSOR/
drwxrwxr-x  2 kat kat       140 Oct 30 10:20 SOURCE/
drwxrwxr-x  2 kat kat       140 Oct 30 10:20 SPECTRAL_WINDOW/
drwxrwxr-x  2 kat kat       120 Oct 30 10:20 STATE/
-rw-rw-r--  1 kat kat      8125 Oct 30 10:20 table.dat
-rw-rw-r--  1 kat kat       271 Oct 30 12:23 table.f0
-rw-rw-r--  1 kat kat   3145728 Oct 30 12:23 table.f0_TSM0
-rw-rw-r--  1 kat kat       284 Oct 30 12:23 table.f1
-rw-rw-r--  1 kat kat   6291456 Oct 30 12:23 table.f1_TSM0
-rw-rw-r--  1 kat kat       304 Oct 30 12:23 table.f2
-rw-rw-r--  1 kat kat   6291456 Oct 30 12:23 table.f2_TSM0
-rw-rw-r--  1 kat kat       274 Oct 30 12:23 table.f3
-rw-rw-r--  1 kat kat   2097152 Oct 30 12:23 table.f3_TSM0
-rw-rw-r--  1 kat kat       273 Oct 30 12:23 table.f4
-rw-rw-r--  1 kat kat   2097152 Oct 30 12:23 table.f4_TSM0
-rw-rw-r--  1 kat kat    231932 Oct 30 12:23 table.f5
-rw-rw-r--  1 kat kat       284 Oct 30 12:23 table.f6
-rw-rw-r--  1 kat kat 392167424 Oct 30 12:23 table.f6_TSM0
-rw-rw-r--  1 kat kat       294 Oct 30 12:23 table.f7
-rw-rw-r--  1 kat kat 197132288 Oct 30 12:23 table.f7_TSM0
-rw-rw-r--  1 kat kat       293 Oct 30 12:23 table.f8
-rw-rw-r--  1 kat kat 197132288 Oct 30 12:23 table.f8_TSM0
-rw-rw-r--  1 kat kat       104 Oct 30 10:20 table.info
-rw-rw-r--  1 kat kat       357 Oct 30 12:23 table.lock

du -sb ms/test/
806914372	ms/test/

Our basic parameters are:

22 dumps selected (scan 1, 15 dumps + scan 4, 7 dumps)
4096 channels
544 correlation products (136 baselines if divided by number of pol terms = 4)

This results in 22 dumps * 544 corrprods / 4 pols = 22 * 136 = 2992 rows in the MS ✅

The basic block size / tile on disk seems to be 2 ** 21 = 2097152 bytes. All file sizes of table.fX_TSM0 are multiples of this (except the index table.f0_TSM0). In particular, it explains why the weights are slightly larger even though its size is a multiple of 2 ** 20 bytes.

The table.fX_TSM0 files have the following associations (open the corresponding table.fX file with strings):

table.f0: 3145728 = 1.5 tiles → some index
table.f1: 6291456 = 3 tiles → FLAG
table.f2: 6291456 = 3 tiles → FLAG_CATEGORY
table.f3: 2097152 = 1 tile → WEIGHT
table.f4: 2097152 = 1 tile → SIGMA
table.f6: 392167424 = 187 tiles → DATA
table.f7: 197132288 = 94 tiles → WEIGHT_SPECTRUM
table.f8: 197132288 = 94 tiles → SIGMA_SPECTRUM

ludwigschwardt · 2024-11-07T10:28:23Z

This suggests the following formula for the main payload in the MS:

import math

# The main table.fX_TSM0 files have sizes that are multiples of these
# block sizes ("bucket" sizes in CASA Tiled Storage Manager speak?).
# XXX This is empirically determined so far, maybe cross-check with code
# See `casacore.tables.table.getdminfo output` and look for `BucketSize`
BIG_BLOCK = 2 ** 21
SMALL_BLOCK = 2 ** 18


def col_size(n_cells, bits, block_size=BIG_BLOCK):
    """Round `n_cells` of `bits` bits up to next block size."""
    return math.ceil(n_cells * bits / 8 / block_size) * block_size


def estimate_ms_size(n_dumps, n_chans, n_corrprods, n_pols=4):
    """Estimate MS size from basic MVF4 parameters."""
    n_baselines = n_corrprods // 4
    n_rows = n_dumps * n_baselines    
    n_cells = n_rows * n_pols
    n_cells_per_spectrum = n_cells * n_chans
    # Start with table.f0_TSM0 (not sure what that is, an index?)
    size = 12 * SMALL_BLOCK
    # DATA: complex64
    size += col_size(n_cells_per_spectrum, bits=64)
    # WEIGHT_SPECTRUM: float32
    size += col_size(n_cells_per_spectrum, bits=32)
    # SIGMA_SPECTRUM: float32
    size += col_size(n_cells_per_spectrum, bits=32)
    # FLAG: bit
    size += col_size(n_cells_per_spectrum, bits=1, block_size=SMALL_BLOCK)
    # FLAG_CATEGORY: bit
    size += col_size(n_cells_per_spectrum, bits=1, block_size=SMALL_BLOCK)
    # WEIGHT: float32
    size += col_size(n_cells, bits=32)
    # SIGMA: float32
    size += col_size(n_cells, bits=32)
    return size

Try it out on the small dataset:

In [10]: estimate_ms_size(22, 4096, 544)
Out[10]: 806354944

In [11]: !du -sb ms/test/
806914372	ms/test/

In [17]: 806354944 / 806914372
Out[17]: 0.9993067071062157

This now underestimates the size by 0.07%. Much better!

ludwigschwardt · 2024-11-07T10:39:25Z

Test this idea on the dataset in the original query: 1703007682.

d.shape = (3545, 32768 , 8064) = 7.494 TB
Select scans=”track” → 109 scans, 3319 dumps
Average channels by factor of 8 → 4096 channels afterwards
Run estimate_ms_size(3319, 4096, 8064) → 1.782 TB

Looks promising!

Out of interest, the corresponding MVF4 size is 3319 * 4096 * 8064 * 10 / 1e12 = 1.096 TB. MS is not that efficient, probably because it stores both SIGMA_SPECTRUM and the redundant WEIGHT_SPECTRUM. Without the latter the MS size could have been 1.343 TB.

The basic number of bytes per element differs like this:

MKAT: complex64 vis + byte weights + byte flags = 8 + 1 + 1 = 10 bytes
MS: complex64 vis + float32 weights + float32 sigma + 2/8 flags = 8 + 4 + 4 + 0.25 = 16.25 bytes
more efficient MS: complex64 vis + float32 sigma + 2/8 flags = 8 + 4 + 0.25 = 12.25 bytes

ludwigschwardt · 2024-11-07T10:49:42Z

The advantage of building this estimate into mvftoms.py as opposed to katsdparchive is that mvftoms.py knows exactly how many dumps, channels and corrprods it will produce based on the selections, unlike the archive software that will need to estimate it.

My suggestion is an estimate_ms_size utility function called from mvftoms.py so that it incorporates the effects of all the script options. We can add an option like --estimate-size to print out the estimated number of bytes. During normal use the script can also print out the estimate, and then determine the size afterwards and report the discrepancy. This will help us a lot with fine-tuning, effectively making every dataset a test case.

ZachSARAO · 2024-11-08T10:24:53Z

Thanks @ludwigschwardt. I'm not following the logic above entirely. But I hazily understand that the size on disk is less than estimated by the archive website.

I downloaded the .rdb file locally (instead of working from the remote at archive-gw-1.kat.ac.za and I see that the input of 3.1MB was expanded to a 755MB directory. What additional data is added / where does it come from?

The command above took several minutes to run on my local, which is too long to do on the fly depending on a users's flag choice.

Would it be possible to speed this up a lot (to a few seconds)? i.e. if data is coming from somewhere else, mount it locally + avoiding any file writes/ + any other optimizations.

Or otherwise... is it possible to improve our estimations in a meaningful way?

ludwigschwardt · 2024-11-08T11:59:49Z

The RDB file is only the metadata (3 MB). The bulk of the data (755 MB) lives in Ceph as chunks / objects. The additional data is the data. 😊

You cannot run mvftoms.py to estimate its size... That's kinda pointless. I ran it to get the correct size in my example. If you could speed it up, we'd be golden 😁

Yes, the estimations could be improved by incorporating the estimate_ms_size function inside mvftoms.py and running only that when passed an --estimate-size option.

ZachSARAO · 2024-11-08T12:42:51Z

How can i tell the number of dumps unless I create the ms? (that's the reason that I thought you had to run mvtoms before estimating size). I see your point how pointless that would be...

ludwigschwardt assigned ludwigschwardt and ZachSARAO Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPR1-3159: Estimate size of MS produced by mvftoms #382

SPR1-3159: Estimate size of MS produced by mvftoms #382

ludwigschwardt commented Nov 7, 2024 •

edited by jira bot

Loading

ludwigschwardt commented Nov 7, 2024 •

edited

Loading

ludwigschwardt commented Nov 7, 2024 •

edited

Loading

ludwigschwardt commented Nov 7, 2024 •

edited

Loading

ludwigschwardt commented Nov 7, 2024 •

edited

Loading

ZachSARAO commented Nov 8, 2024 •

edited

Loading

ludwigschwardt commented Nov 8, 2024

ZachSARAO commented Nov 8, 2024

SPR1-3159: Estimate size of MS produced by mvftoms #382

SPR1-3159: Estimate size of MS produced by mvftoms #382

Comments

ludwigschwardt commented Nov 7, 2024 • edited by jira bot Loading

ludwigschwardt commented Nov 7, 2024 • edited Loading

ludwigschwardt commented Nov 7, 2024 • edited Loading

ludwigschwardt commented Nov 7, 2024 • edited Loading

ludwigschwardt commented Nov 7, 2024 • edited Loading

ZachSARAO commented Nov 8, 2024 • edited Loading

ludwigschwardt commented Nov 8, 2024

ZachSARAO commented Nov 8, 2024

ludwigschwardt commented Nov 7, 2024 •

edited by jira bot

Loading

ludwigschwardt commented Nov 7, 2024 •

edited

Loading

ludwigschwardt commented Nov 7, 2024 •

edited

Loading

ludwigschwardt commented Nov 7, 2024 •

edited

Loading

ludwigschwardt commented Nov 7, 2024 •

edited

Loading

ZachSARAO commented Nov 8, 2024 •

edited

Loading