You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Throughout the workflow, it would be helpful to integrate logged checks for files that error from input through staging, from staging through merging (if using the Ray workflow), and from merging through the rasterization step. Kastan noted that using the ray workflow resulted in approximately 2000 out of approximately 8 million files failing to process correctly.
To start, we should implement the minimum viable product (MVP) as a comparison of the filepaths at the conclusion of the staging step with the filepaths at the conclusion of the rasterization step. The initial filepaths list can pulled from the staged directory, or from the filepaths in staging_summary.csv.
Ideally, we will eventually implement more rigorous checks for all polygon vectors that are present in input files that error during staging and are therefore not fed into the merging or rasterization steps. However, this is more complex than the MVP, considering the following:
deduplication of polygons can occur during staging, rasterization, web tiling, etc. based on the user's config
polygons that cross tile boundaries are documented in all tiles in which a portion of the polygon is present
the input files may contain no polygons whatsoever
during staging, the data are in the form of polygons, but after rasterization, they are in the form of grid cells, which makes it impossible to track the files via file sizes
the CRS may need to be converted during staging, which changes polygon attributes such as area
Robyn noted that an expansion beyond the MVP might be best executed by using the footprints to determine if we expect a polygon within each file's bounds, and comparing the footprints to their respective processed files using an overlay method from geopandas.
The text was updated successfully, but these errors were encountered:
robyngit
changed the title
Integrate check for file processing errors from staging through rasterization
Implement system to track successes & errors at each stage of workflow, at the file & tile level
Dec 13, 2024
Instead of relying on logging, we might want to implement a system that records the status of each file and all its tiles through workflow stages (input, staging, merging, rasterization). It should automatically summarize successes and failures.
Throughout the workflow, it would be helpful to integrate logged checks for files that error from input through staging, from staging through merging (if using the Ray workflow), and from merging through the rasterization step. Kastan noted that using the ray workflow resulted in approximately 2000 out of approximately 8 million files failing to process correctly.
To start, we should implement the minimum viable product (MVP) as a comparison of the filepaths at the conclusion of the staging step with the filepaths at the conclusion of the rasterization step. The initial filepaths list can pulled from the
staged
directory, or from the filepaths instaging_summary.csv
.Ideally, we will eventually implement more rigorous checks for all polygon vectors that are present in input files that error during staging and are therefore not fed into the merging or rasterization steps. However, this is more complex than the MVP, considering the following:
Robyn noted that an expansion beyond the MVP might be best executed by using the footprints to determine if we expect a polygon within each file's bounds, and comparing the footprints to their respective processed files using an overlay method from
geopandas
.The text was updated successfully, but these errors were encountered: