-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory flooding #106
Comments
Yeah you killed it... I rebooted, no worries. Put "ulimit -m ..." in your .bashrc so this doesn't happen again. This puts a memory limit on your jobs so they get killed before they cramp up the system. |
@bngcebetsha thanks for reporting. This is a known issue with the dask collections interface and the whole reason why I created the |
I should probably set up the dask cluster with a memory limit but I've found that once it starts spilling to disk you might as well kill the run (at least it used to be the case). I probably need to look at this again. In the meantime, I can set the following at startup
which should raise a memory error if the total memory is exceeded. I'm just not sure if that may have unintended consequences. What do reckon @o-smirnov? |
This has been done in the band_actors branch you are using @bngcebetsha. Please let me know if you run into issues with this |
|
Bizarre! I don't see an actual error but the amount of spam produced by dask makes it hard to be sure. Point me at the data and I will take a look. I see it only has 16 chunks to work with (are there only 2 scans in this MS?). I've never tried with fewer chunks than workers, it may be that they are timing out or something. Maybe try with |
Yes there are only 2 scans - I have split out the target field. The data is on |
@bngcebetsha I can reproduce your error. It looks like it is indeed because there are more workers tasks (see the workaround suggested in #108) so this is a separate issue. Let's keep this one open until I have tested and merged my fix for the original memory issue you ran into. @Athanaseus note that the memory issues you were seeing should be fixed in the version of the |
This step is not fast when reading from a measurement set but I'm surprised at the memory footprint given that this is 1k data. I don't have access to young so I can't look at your data directly. You can reduce the memory footprint using the |
There are two scans spread over 875533 integrations (not evenly) - I tried to see how the two scan are distributed: |
Oh, wow! Are you sure those are integrations and not rows? With an 8s integration time that implies scan 1 is about 1895 hours on source |
you are right that's the number of rows(I assumed that equals number of integrations) |
I guess the correct number of integrations is total_time/8s -> 4338.16/8 -> 542.27. We are looking at 271 per scan |
Close enough. I'm surprised you are getting close to 500GB memory usage in that case. It could have something to do with the measurement set columns (eg. if one is tiled funny or something) but I would need access to take a closer look. The workaround is still to set |
Thanks @landmanbester - I'll keep tweaking around, thinks seem to be improving a bit. The data looks like this: 2024-06-23 13:56:56 INFO listobs::ms::summary+ Observation: MeerKAT
2024-06-23 13:56:56 INFO listobs::MSMetaData::_computeScanAndSubScanProperties Computing scan and subscan properties...
2024-06-23 13:56:56 INFO listobs::ms::summary Data records: 875533 Total elapsed time = 4338.16 seconds
2024-06-23 13:56:56 INFO listobs::ms::summary+ Observed from 28-Apr-2019/16:03:41.9 to 28-Apr-2019/17:16:00.0 (UTC)
2024-06-23 13:56:56 INFO listobs::ms::summary
2024-06-23 13:56:56 INFO listobs::ms::summary+ ObservationID = 0 ArrayID = 0
2024-06-23 13:56:56 INFO listobs::ms::summary+ Date Timerange (UTC) Scan FldId FieldName nRows SpwIds Average Interval(s) ScanIntent
2024-06-23 13:56:56 INFO listobs::ms::summary+ 28-Apr-2019/16:03:41.9 - 17:03:42.4 1 0 J1429-6240T 852841 [0] [7.98] [TARGET]
2024-06-23 13:56:56 INFO listobs::ms::summary+ 17:14:26.1 - 17:16:00.0 21 0 J1429-6240T 22692 [0] [7.68] [TARGET]
2024-06-23 13:56:56 INFO listobs::ms::summary (nRows = Total number of rows per scan)
2024-06-23 13:56:56 INFO listobs::ms::summary Fields: 1 I am currently getting this exception thrown: kwargs: {'dc1': 'CORRECTED_DATA', 'dc2': 'MODEL_DATA', 'operator': '-', 'ds': <xarray.Dataset> Size: 2GB
Dimensions: (row: 189100, uvw: 3, chan: 128, corr: 4)
Coordinates:
ROWID (row) int32 756kB dask.array<chunksize=(189100,), meta=np.ndarray>
Dimensions without coordinates: row, uvw, chan, corr
Data variables:
ANTENNA1 (row) int32 756kB dask.array<chunksize=(189100,), meta=np.ndarray>
INTERVAL (row) float64 2MB dask.array<chunksize=(189100,), meta=np.ndarray>
UVW (row, uvw) float64 5MB dask.array<chunksize=(189100, 3), meta=np.ndarray>
TIME (row) float64 2MB dask.array<chunksize=(189100,), meta=np.ndarray>
FLAG (row, chan, corr) bool 97MB dask.array<chunksize=(189100, 128, 4), meta=np.ndarray>
FLAG_ROW (row) bool 189kB dask.array<chunksize=(189100,), meta=np.ndarray>
ANTENNA2 (row) int32 756kB dask.array<chunksize=(189100,), meta=np.ndarray>
WEIGHT_SPECTRUM (row, chan
Exception: 'AttributeError("\'UnicodeType\' object has no attribute \'literal_value\'")' The This might give more insight: ovf_result = self._overload_func(*args, **kws)
File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/weighting.py", line 271, in nb_weight_data_impl
vis_func, wgt_func = stokes_funcs(data, jones, product, pol, nc)
File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/stokes.py", line 41, in stokes_funcs
if pol.literal_value == 'linear':
AttributeError: 'UnicodeType' object has no attribute 'literal_value' I tried making my ms as small as possible to at least get something working - it finally threw the above and bailed. |
That exception must be coming from numexpr but I have not seen it before. Let me try to reproduce once I have access to the data.
I would be curious to find out if the change I made kills the program automatically when it hits max memory. I'm surprised by this but you can reduce the memory footprint further by using fewer workers or smaller time chunks. |
ovf_result = self._overload_func(*args, **kws)
File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/weighting.py", line 271, in nb_weight_data_impl
vis_func, wgt_func = stokes_funcs(data, jones, product, pol, nc)
File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/stokes.py", line 41, in stokes_funcs
if pol.literal_value == 'linear':
AttributeError: 'UnicodeType' object has no attribute 'literal_value' That looks like a numba error, I imagine it's not getting called with |
Is this in the error.md file? Not sure how I missed that |
Bottom of this comment: #106 (comment) |
Alright. I think I may have an answer. Looking at the tiling we see that the frequency axis of the CORRECTED_DATA column essentially isn't tiled whereas QC_CORRECTED_DATA is tiled. If I run your command with QC_CORRECTED_DATA i.e. pfb init --ms /net/young/home/bngcebetsha/cal_quartical/1556467257_sdp_l0.f0.2nd_avg.1k_8s.ms -ipi -1 -cpi 128 -o output --data-column 'QC_CORRECTED_DATA' --weight-column WEIGHT_SPECTRUM -nw 16 --bda-decorr 0.98 -ldir pfb-logs --overwrite It runs through in just under 2mins with the expected memory usage
So I think the problem lies in the tiling of the MS columns. Note that QC_CORRECTED_DATA was created using |
Hold on. This is unrelated and probably has something to do with your installation. Please open a separate issue if you still see this after creating a fresh virtualenv and updating pip setuptools wheel before installing pfb |
interesting - the The ms I used there became corrupted when I tried to continue work on the Rhodes node. So I split the target field again and started at the |
Ok, so there are a number of options available to you. The easiest is just to do this step in serial (i.e.
before installing pfb |
worked, and this - #109 as you suggested I opened a new issue for that. I might need to close THIS issue soon. |
Leave it open please. I will close it once I've confirmed a few things |
fixed in #111 |
I was running these steps:
The first step ran indefinitely, so I ran
htop
to see the memory footprint. On the 500G machine (young.ru), the processes I started were approching 100% consumption. I issued aCtrl-C
and waited, the memory remained at 100% and the terminal was unresponsive. I can no longer log back in - I do hope this is something that can be fixed with a simple reboot by Sys Admin.The text was updated successfully, but these errors were encountered: