-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate recovered footprints into the IWP workflow on delta #12
Comments
@robyngit in the above comment, you say:
Does that mean that within the new py script I'm writing, the filepath called |
Actually, I believe I misinterpreted. I think what you meant is that in the new script, we should be pulling the "original" footprint shapefiles from the same location |
@julietcohen - you should copy the footprints from Hopefully the footprints Elias sent you have a Name or ID property that you can use to match them to the correct shapefile, like I do in that |
Ok thank you for clarifying, Robyn! Glad to know I don't have to re-process the shapefiles files in Kastan's Yes, the footprints Elias sent do have a # Where is all the original input data, before any staging or processing
input_dir = Path('/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/') And when the matching |
Copied the dir: |
@robyngit do the entire filepaths starting from |
@julietcohen No, the entire path does not have to match. It just needs to match relative to the {
"dir_footprints": "some/random/path/iwp_footprints/",
"dir_input": "another/different/path/iwp_input_data/"
} If the first sub directory under In other words, if the workflow will process a shapefile found at |
Great, thanks! I'm making good progress on integrating the new footprints. |
Need clarification: why are there duplicate recovered footprints, and even more duplicate recovered footprint ID codes?There is supposed to be one footprint file for each IWP shapefile. However, there are duplicate rows within the count number of duplicate rowsimport geopandas as gpd
# Read in the recovered footprints that we need to integrate with their geometries
recovered_fps = gpd.read_file('/u/julietcohen/integrate_rec_fp/recovered_footprints/recovered_footprints.shp')
# generate series of True or False values that represent if each row is a duplicate
boolean_dup_recovered_fps = recovered_fps.duplicated()
# loop through that boolean series,
# and separate True and False values into separate lists
true = []
false = []
for i in boolean_dup_recovered_fps:
if i == True:
true.append(i)
else:
false.append(i)
print(f'Within the geodataframe of recovered_footprints, {len(true)} are duplicates.') Output: There are even more duplicate footprint ID codes that are duplicates: 293 duplicates. That means that there are different geometries/dates that are identified as the same footprint ID code. count number of duplicate footprint ID codesimport geopandas as gpd
# Read in the recovered footprints that we need to integrate with their geometries
recovered_fps = gpd.read_file('/u/julietcohen/integrate_rec_fp/recovered_footprints/recovered_footprints.shp')
# convert column of names of recovered footprints into a list without geometries
recovered_fps_names = recovered_fps['Name'].to_list()
len(recovered_fps_names) # 1242 total values
# check for duplicates within the footprint ID codes
recovered_fps_names_unique = list(set(recovered_fps_names))
len(recovered_fps_names_unique) # 949 unique values These numbers, 1242 and 949, match up with the values I derived in the other footprints issue #13. Considering that the footprint ID codes only represent a portion of the IWP filename, does this mean they have already been processed and now it is harder to tell which footprint ID belongs to which IWP file? Looks like the scene ID's have already been fed into Robyn's function def id_to_name(id):
"""Convert the scene ID from input filename to 'Name' attribute in footprint shapefiles"""
return '_'.join(id.split('_')[:-2]) print example of a random footprint ID codeimport geopandas as gpd
# Read in the recovered footprints that we need to integrate with their geometries
recovered_fps = gpd.read_file('/u/julietcohen/integrate_rec_fp/recovered_footprints/recovered_footprints.shp')
# example of a recovered footprint ID code
# convert column of names of recovered footprints into a list without date or geometries
recovered_fps_names = recovered_fps['Name'].to_list()
recovered_fps_names[0]
print example of a random IWP file basenameimport os
from pathlib import Path
# Where is all the original input data, before any staging or processing
iwp_dir = Path('/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/')
# example of a filename of an IWP file, without leading filepath
# create list of all IWP files
input_file_list = sorted(iwp_dir.rglob('*.shp'))
iwp_filepath = str(input_file_list[0])
iwp_basename = os.path.basename(iwp_filepath).split('.')[0]
print(iwp_basename)
|
@julietcohen my first thought is that these might be cases where a footprint for a file is a non-continuous area. (i.e. the footprint is made of two or more polygons that do not touch). I suggest selecting one duplicated example, and plot the geometries along with the matching shapefile. |
IWP files that all share the same footprint ID that is one of the recovered footprints
Note that all 3 of these .shp file basenames are identical, but they differ in the subdirs of The associated recovered footprint is found at both index 2 and index 117 of the The dates and geometries of these recovered footprints are also identical: str(fp_of_interest['geometry'].iloc[0]) == str(fp_of_interest['geometry'].iloc[1]) Output: I'm confused about the python syntax for plotting these IWP shapefiles on top of the polygon geometry for this footprint. I am able to yield this, which is not helpful for our purposes: I'll keep trying with |
Closing this issue because other team members are going to re-process the IWP files and re-structure footprints directory to match those new files. This issue is only applicable to the current IWP dataset, which will likely not be used in the future. |
We need to be able to reliably match footprints to shapefiles in order to deduplicate. The workflow is designed to look for footprints with the same name and same relative directory structure as their corresponding input shapefiles. This convention is illustrated and detailed in the docs.
Elias recently recovered some missing footprints from the IWP workflow. He put all of these footprints into this single shape file: 🗂 recovered_footprints.zip
We need to incorporate these footprints to our
staged_footprints
directory on delta. They need to be matched to their shapefiles so that we can create the correct filenames and directory hierarchy.Previously, I restructured the footprint files provided by the IWP team using this script (also saved on delta:
/u/thiessenbock/git/debug-footprints/prepare-footprints.py
):prepare-footprints.py
The recovered footprints file structure differs from the original "footprints_new" file structure, so we can't use the same matching pattern that we do in
prepare_footprints.py
. We will need to write a new script.The footprints we use for the workflow now are saved to
/scratch/bbki/thiessenbock/pdg/staged_footprints
and also copied to a directory in/scratch/bbki/kastanday/
.Related to PermafrostDiscoveryGateway/pdg-portal#24
The text was updated successfully, but these errors were encountered: