Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate recovered footprints into the IWP workflow on delta #12

Closed
robyngit opened this issue Dec 8, 2022 · 12 comments
Closed

Integrate recovered footprints into the IWP workflow on delta #12

robyngit opened this issue Dec 8, 2022 · 12 comments

Comments

@robyngit
Copy link
Member

robyngit commented Dec 8, 2022

We need to be able to reliably match footprints to shapefiles in order to deduplicate. The workflow is designed to look for footprints with the same name and same relative directory structure as their corresponding input shapefiles. This convention is illustrated and detailed in the docs.

Elias recently recovered some missing footprints from the IWP workflow. He put all of these footprints into this single shape file: 🗂 recovered_footprints.zip

We need to incorporate these footprints to our staged_footprints directory on delta. They need to be matched to their shapefiles so that we can create the correct filenames and directory hierarchy.

Previously, I restructured the footprint files provided by the IWP team using this script (also saved on delta: /u/thiessenbock/git/debug-footprints/prepare-footprints.py):

prepare-footprints.py
from pathlib import Path
import json
import geopandas as gpd
import warnings

# Where is all the original input data, before any staging or processing
input_dir = Path('/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/')
# Where are the original footprints?
footprint_dir = '/scratch/bbki/kastanday/maple_data_xsede_bridges2/outputs/footprints/footprints_new'
# Where should we save the prepared footprints?
footprint_staged_dir = '/scratch/bbki/thiessenbock/pdg/staged_footprints'
# Where should we save the JSON list of input files missing footprints?
missing_footprints_path = '/scratch/bbki/thiessenbock/pdg/files-missing-footprints.json'

def get_fp_path(input_path):
    """Given the input path, get the staged footprint path"""
    # Remove the base directory from the input path
    input_path_rel = str(input_path).removeprefix(str(input_dir)).removeprefix('/')
    fp_path = Path(footprint_staged_dir).joinpath(input_path_rel).with_suffix('.gpkg')
    return fp_path

def dir_to_filename(dir):
    """Change the scene directory from input path to file name for footprints"""
    return 'selection_' + dir.removesuffix('_iwp') + '.shp'

def id_to_name(id):
    """Convert the scene ID from input filename to 'Name' attribute in footprint shapefiles"""
    return '_'.join(id.split('_')[:-2])

def contains_multipolygons(gdf):
    types = list(set(gdf.geom_type))
    return 'MultiPolygon' in types

# Find input files
print('Finding input files')
input_file_list = input_dir.rglob('*.shp')
print(f'Input files found')

inputs_missing_footprints = []
for i, path in enumerate(input_file_list):
    print(f'🏁 Looking for footprint for file {i+1}')
    path_parts = path.parts
    scene_id = path_parts[-2]
    scene_dir = path_parts[-3]
    region = path_parts[-4]
    # footprints follow format FOOTPRINT_DIR/REGION/SHORT_ID.shp
    footprint_file = Path(footprint_dir).joinpath(region).joinpath(dir_to_filename(scene_dir))
    if not (footprint_file.exists()):
        print('🤷‍♀️ No footprint file found')
        inputs_missing_footprints.append(str(path))
    else:
        # each footprint shapefile contains footprints for multiple inputs
        all_fps = gpd.read_file(footprint_file)
        fp = all_fps[all_fps['Name'] == id_to_name(scene_id)]
        if len(fp) == 0:
            print('🤷‍♀️ No footprint ID found')
            inputs_missing_footprints.append(str(path))
        else:
            print('✅ Found a footprint')
            if contains_multipolygons(fp):
                fp = fp.explode()
            # Create the path where the footprint should be saved
            footprint_staged_path = get_fp_path(path)
            # Make any necessary parent directories
            footprint_staged_path.parents[0].mkdir(parents=True, exist_ok=True)
            # Save the single-polygon vector file to the filepath
            with warnings.catch_warnings():
                warnings.simplefilter('ignore', FutureWarning)
                fp.to_file(footprint_staged_path)

# Save a list of all the files that do not have footprints
with open(missing_footprints_path, 'w') as f:
    json.dump(inputs_missing_footprints, f)

The recovered footprints file structure differs from the original "footprints_new" file structure, so we can't use the same matching pattern that we do in prepare_footprints.py. We will need to write a new script.

The footprints we use for the workflow now are saved to /scratch/bbki/thiessenbock/pdg/staged_footprints and also copied to a directory in /scratch/bbki/kastanday/.

Related to PermafrostDiscoveryGateway/pdg-portal#24

@julietcohen
Copy link
Collaborator

@robyngit in the above comment, you say:

The footprints we use for the workflow now are saved to /scratch/bbki/thiessenbock/pdg/staged_footprints and also copied to a directory in /scratch/bbki/kastanday/.

Does that mean that within the new py script I'm writing, the filepath called footprint_dir (where we pull the "original" footprints from) should now be /scratch/bbki/thiessenbock/pdg/staged_footprints, and since I am going to take the lead in the workflow runs, I should save the prepared footprints to my user dir (the filepath called footprint_staged_dir)?

@julietcohen
Copy link
Collaborator

julietcohen commented Jan 10, 2023

Actually, I believe I misinterpreted. I think what you meant is that in the new script, we should be pulling the "original" footprint shapefiles from the same location /scratch/bbki/kastanday/maple_data_xsede_bridges2/outputs/footprints/footprints_new which is called footprint dir, which is divided into alaska, canada, and russia subdirs and subdivided into selection shapefiles. And what needs to change is the file hierarchy in the recovered_footprints so that the shapefiles files are also divided into alaska, canada, and russia subdirs, and then also subdivided into selection shapefiles. There is currently no file hierarchy in the file Elias provided. When I unzipped it, it is a single .shp file with all the accompanying files (such as .sbx, dbf, etc)

@robyngit
Copy link
Member Author

@julietcohen - you should copy the footprints from /scratch/bbki/thiessenbock/pdg/staged_footprints to your own scratch directory, then add Elias's footprints to your staged_footprints directory with the appropriate file hierarchy. You can then use these newly merged footprints for the next IWP run.

Hopefully the footprints Elias sent you have a Name or ID property that you can use to match them to the correct shapefile, like I do in that prepare_footprints.py script. Once you match a footprint to a shapefile, then you can create a new file with the same name and same directory structure as the shapefile. So one footprint == one file. This is what I am trying to illustrate here. Let me know if I should clarify this better!

@julietcohen
Copy link
Collaborator

Ok thank you for clarifying, Robyn! Glad to know I don't have to re-process the shapefiles files in Kastan's footprints_new dir.

Yes, the footprints Elias sent do have a Name attribute that I can use to match them to their shapefile; each row of the geodataframe has Name and geometry attribute. So for each Name, I will search in the directory that contains all the input data for the workflow:

# Where is all the original input data, before any staging or processing
input_dir = Path('/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/')

And when the matching .shp file is found, I create a new file in my directory that I copied from you, /scratch/bbou/julietcohen/staged_footprints, with the same base Name but as a .gpkg using the geometry from that row.

@julietcohen
Copy link
Collaborator

Copied the dir:
/scratch/bbki/thiessenbock/staged_footprints/,
not:
/scratch/bbki/thiessenbock/pdg/staged_footprints
because the staged_footprints dir was not within your pdg dir, that pdg dir seems to be the staged output of past workflow runs.

@julietcohen
Copy link
Collaborator

@robyngit do the entire filepaths starting from /scratch need to match between the IWP shp file and that file's gpkg footprint file, or just the part starting at high_ice folder? Based on your illustration, I think it just needs to match starting at high_ice. So we will be able to keep the IWP shapefiles in Kastan's scratch dir in bbki and their footprints can live in my scratch dir in bbou as long as high_ice and everything downstream matches.

@robyngit
Copy link
Member Author

@julietcohen No, the entire path does not have to match. It just needs to match relative to the dirs that you will eventually configured in the IWP config json. For example, let's say that you have these two paths set for dir_footprints and dir_input:

{
   "dir_footprints": "some/random/path/iwp_footprints/",
   "dir_input": "another/different/path/iwp_input_data/"
}

If the first sub directory under iwp_input_data is high_ice, then the first sub directory under iwp_footprints must also be high_ice.

In other words, if the workflow will process a shapefile found at another/different/path/iwp_input_data/high_ice/alaska/a/b/c.shp, then the matching footprint should be saved to: some/random/path/iwp_footprints/high_ice/alaska/a/b/c.gpkg

@julietcohen
Copy link
Collaborator

Great, thanks! I'm making good progress on integrating the new footprints.

@julietcohen
Copy link
Collaborator

julietcohen commented Jan 11, 2023

Need clarification: why are there duplicate recovered footprints, and even more duplicate recovered footprint ID codes?

There is supposed to be one footprint file for each IWP shapefile. However, there are duplicate rows within the recovered_footprints geodataframe that contains three columns: footprint ID code, date, and the geometry.

count number of duplicate rows
import geopandas as gpd

# Read in the recovered footprints that we need to integrate with their geometries
recovered_fps = gpd.read_file('/u/julietcohen/integrate_rec_fp/recovered_footprints/recovered_footprints.shp')

# generate series of True or False values that represent if each row is a duplicate
boolean_dup_recovered_fps = recovered_fps.duplicated()

# loop through that boolean series, 
# and separate True and False values into separate lists
true = []
false = []
for i in boolean_dup_recovered_fps:
    if i == True:
        true.append(i)
    else:
        false.append(i)

print(f'Within the geodataframe of recovered_footprints, {len(true)} are duplicates.')

Output: Within the geodataframe of recovered_footprints, 23 are duplicates.

There are even more duplicate footprint ID codes that are duplicates: 293 duplicates. That means that there are different geometries/dates that are identified as the same footprint ID code.

count number of duplicate footprint ID codes
import geopandas as gpd

# Read in the recovered footprints that we need to integrate with their geometries
recovered_fps = gpd.read_file('/u/julietcohen/integrate_rec_fp/recovered_footprints/recovered_footprints.shp')

# convert column of names of recovered footprints into a list without geometries
recovered_fps_names = recovered_fps['Name'].to_list()

len(recovered_fps_names) # 1242 total values

# check for duplicates within the footprint ID codes
recovered_fps_names_unique = list(set(recovered_fps_names))
len(recovered_fps_names_unique) # 949 unique values

These numbers, 1242 and 949, match up with the values I derived in the other footprints issue #13.

Considering that the footprint ID codes only represent a portion of the IWP filename, does this mean they have already been processed and now it is harder to tell which footprint ID belongs to which IWP file? Looks like the scene ID's have already been fed into Robyn's function id_to_name():

def id_to_name(id):
    """Convert the scene ID from input filename to 'Name' attribute in footprint shapefiles"""
    return '_'.join(id.split('_')[:-2])
print example of a random footprint ID code
import geopandas as gpd

# Read in the recovered footprints that we need to integrate with their geometries
recovered_fps = gpd.read_file('/u/julietcohen/integrate_rec_fp/recovered_footprints/recovered_footprints.shp')

# example of a recovered footprint ID code
# convert column of names of recovered footprints into a list without date or geometries
recovered_fps_names = recovered_fps['Name'].to_list()
recovered_fps_names[0]

'WV03_20160803221654_1040010021643200_16AUG03221654-M1BS-500854877100_01_P002'

print example of a random IWP file basename
import os
from pathlib import Path

# Where is all the original input data, before any staging or processing
iwp_dir = Path('/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/')

# example of a filename of an IWP file, without leading filepath
# create list of all IWP files
input_file_list = sorted(iwp_dir.rglob('*.shp'))

iwp_filepath = str(input_file_list[0])
iwp_basename = os.path.basename(iwp_filepath).split('.')[0]
print(iwp_basename)

GE01_20110826213903_10504100013F3800_11AUG26213903-M1BS-054019163020_01_P001_u16rf3413_pansh

@julietcohen julietcohen added the help wanted Extra attention is needed label Jan 11, 2023
@robyngit
Copy link
Member Author

@julietcohen my first thought is that these might be cases where a footprint for a file is a non-continuous area. (i.e. the footprint is made of two or more polygons that do not touch). I suggest selecting one duplicated example, and plot the geometries along with the matching shapefile.

@julietcohen
Copy link
Collaborator

IWP files that all share the same footprint ID that is one of the recovered footprints

  1. '/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice/alaska/167_168_iwp/WV02_20140726005256_1030010034C98F00_14JUL26005256-M1BS-504637027060_01_P003_u16rf3413_pansh/WV02_20140726005256_1030010034C98F00_14JUL26005256-M1BS-504637027060_01_P003_u16rf3413_pansh.shp'

image

  1. '/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice/alaska/178_179_180_iwp/WV02_20140726005256_1030010034C98F00_14JUL26005256-M1BS-504637027060_01_P003_u16rf3413_pansh/WV02_20140726005256_1030010034C98F00_14JUL26005256-M1BS-504637027060_01_P003_u16rf3413_pansh.shp'

image

  1. '/scratch/bbki/kastanday/maple_data_xsede_bridges2/glacier_water_cleaned_shp/high_ice/alaska/207_208_209_223_224_iwp/WV02_20140726005256_1030010034C98F00_14JUL26005256-M1BS-504637027060_01_P003_u16rf3413_pansh/WV02_20140726005256_1030010034C98F00_14JUL26005256-M1BS-504637027060_01_P003_u16rf3413_pansh.shp'

image

Note that all 3 of these .shp file basenames are identical, but they differ in the subdirs of alaska dir. I already noted this in this comment in issue #13 .

The associated recovered footprint is found at both index 2 and index 117 of the recovered_footprints dataframe:
image

The dates and geometries of these recovered footprints are also identical:

str(fp_of_interest['geometry'].iloc[0]) == str(fp_of_interest['geometry'].iloc[1])

Output: True

I'm confused about the python syntax for plotting these IWP shapefiles on top of the polygon geometry for this footprint. I am able to yield this, which is not helpful for our purposes:
image

I'll keep trying with matplotlib.

@julietcohen julietcohen removed the help wanted Extra attention is needed label Jan 11, 2023
@julietcohen
Copy link
Collaborator

Closing this issue because other team members are going to re-process the IWP files and re-structure footprints directory to match those new files. This issue is only applicable to the current IWP dataset, which will likely not be used in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants