Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add L2 #80

Merged
merged 27 commits into from
Dec 13, 2023
Merged

add L2 #80

merged 27 commits into from
Dec 13, 2023

Conversation

Geet-George
Copy link
Owner

No description provided.

this function will get the helper functions and convert to SI the variables provided as an argument.

the concept of an attribute `_interim_l2_ds` also starts with this commit. The purpose of this attribute is to carry forward changes from L1 to L2 as an interim product, such that none of the previous products are affected. This same product will keep getting updated as it goes through the different steps necessary to go from L1 to L2. The prefixed underscore is an indication that this attribute is for internal use only and not for users to access. The attribute is subject to change depending on the different steps taken. Hence, any step from L1 to L2 will check first if the attribute already exists. If it does, the attribute is overwritten after the change. If the attribute does not already exist, the starting point for this attribute is the `aspen_ds` attribute. So, the `_interim_l2_ds` attribute will be set to its starting value, i.e. `self.aspen_ds`.
convert to SI while going from L1 to L2 (also introduces concept of `_interim_l2_ds`.
@Geet-George Geet-George linked an issue Dec 11, 2023 that may be closed by this pull request
7 tasks
add only l2 vars from aspen to interim_l2_ds
a default dict is read from the helper module
if config provides dict as string, this is parsed with ast.literal_eval
if the provided dict has a 'rename_to' key, then the variable name is renamed (default dict contains 'rename_to' for some vars)
docstrings also changed to reflect above changes
The Data_Directory should be of the structure where each directory in it should stand for a platform and directories within a platform's directory would be individual flight directories. This will be made mandatory. The package will then auto-infer platform names (`platforms` attribute) based on the platform directories' names (with the `get_platforms` function in the `pipeline` module. This value will go in to the dataset attributes (`platform_id`) along with `flight_id`.

Now, the only way to batch process will be to process for all sondes of a campaign, i.e. all sondes from all flights of all platforms in a campaign. If the user wants a subset of the batching, they can choose to only include limited directories in the `data_directory` they provide in the config file. However, considering that the processing doesn't take is not compute-heavy, there are no use-cases coming to my mind which warrant a separate mode of batch processing.
For each `Platform` object, the `create_and_populate_flight_object` function in the pipeline module now will get all corresponding `flight_id` values by looping over all directory names in a platform's directory and process all sondes in flight-wise batches. The function also now only has one output, i.e. "sondes". The "flight" output is no more relevant.

After the flight-wise batch processing is done, all L2 files in the corresponding `flight_id` directories will be populated with L2 datasets that contain the corresponding `platform_id` and `flight_id` attributes.

Now, the only way to batch process will be to process for all sondes of a campaign, i.e. all sondes from all flights of all platforms in a campaign. If the user wants a subset of the batching, they can choose to only include limited directories in the `data_directory` they provide in the config file. However, considering that the processing doesn't take is not compute-heavy, there are no use-cases coming to my mind which warrant a separate mode of batch processing.
Following is the logic behind the commits in this PR

## Data directory structure

The following is taken from the current documentation:

> The Data_Directory is a directory that includes all data from a single campaign. Therein, data from individual flights are stored in their respective folders with their name in the format of YYYYMMDD, indicating the flight-date. In case of flying through midnight, consider the date of take-off. In case, there are multiple flights in a day, consider adding alphabetical suffixes to distinguish chronologically between flights, e.g. 20200202-A and 20200202-B would be two flights on the same day in the same order that they were flown.

This system excludes the possibility of having multiple platforms in a single campaign. However, batching by campaign can be one of two options: for a single-platform campaign, batching across all flights in the campaign or for a multi-platform campaign, batching across all flights of a platform and then, again batching across all platforms of the campaign. Currently, batching is only possible for all sondes in a single flight. This is done by providing a mandatory `flight_id` in the config file.

## Suggested changes:

The Data_Directory should be of the structure where each directory in it should stand for a platform and directories within a platform's directory would be individual flight directories. This will be made mandatory. The package will then auto-infer platform names (`platforms` attribute) based on the platform directories' names. This value will go in to the dataset attributes (e.g. `platform_id`) and if the user wishes, also in the filenames of the dataset.

If the user wishes to provide custom `platforms` values, it can be provided as an attribute under the `MANDATORY` section of the config file, but then a separate `platform_directory_names` must be provided which will provide the platforms' data directory names in the same sequence as the platform names provided in `platforms`. If there are multiple platforms in the campaign, the `platforms` values provided by the user must be comma-separated values, e.g. `halo,wp3d` (preceding and succeeding spaces will be part of the platform name, e.g. when setting the `platform_id` name). If there is only one platform, provide a name with no commas.

Now, the only way to batch process will be to process for all sondes of a campaign, i.e. all sondes from all flights of all platforms in a campaign. If the user wants a subset of the batching, they can choose to only include limited directories in the `data_directory` they provide in the config file. However, considering that the processing doesn't take is not compute-heavy, there are no use-cases coming to my mind which warrant a separate mode of batch processing. 

## Now how to go about doing this?

The function `create_and_populate_flight_object` in the `pipeline` module processes all sondes of a flight.

A new function in the pipeline module `get_platforms` will get `platforms` value/s based on the directory names in `data_directory` or the user-provided `platforms` values corresponding to directory names (`platform_directory_names`). For each platform, a `Platform` object will be created with its `platform_id` attribute coming from the `platforms` attribute.

For each `Platform` object, another function in the pipeline module will get all corresponding `flight_id` values by looping over all directory names in a platform's directory and process all sondes in flight-wise batches.

After the flight-wise batch processing is done, all L2 files in the corresponding `flight_id` directories will be populated with L2 datasets that contain the corresponding `platform_id` and `flight_id` attributes. For creating L3 and onwards, the script will just look for all L2 files in the `data_directory` and get the flight and platform information from the `platform_id` and `flight_id` attributes of the L2 files.
This commit introduces a new method, get_flight_attributes, to the Sonde class. This method reads flight attributes from the A-file (`self.afile`) and adds them as attributes to the sonde object. The flight attributes to be read and their corresponding renamed attribute names for the L2 file are defined in a dictionary, l2_flight_attributes_map.

The l2_flight_attributes_map dictionary has been added to the helper module. It maps the original attribute names in the A-file to the new attribute names to be used in the L2 file.
for some of the global attributes, the attributes from which the values are taken
change name depending on which ASPEN version has been used to create L1 products.
Therefore, different attribute names have been kept as options.
for older versions of ASPEN, when L1 is created, the SondeId is not created
as an attribute. Hence, a workaround for such older files is to retrieve the serial ID
from the SoundingDescription attribute. Ideally, the L1 files would be reprocessed with a newer ASPEN version.
@Geet-George Geet-George marked this pull request as ready for review December 13, 2023 11:28
@Geet-George Geet-George merged commit 2621df7 into main Dec 13, 2023
1 check passed
@Geet-George Geet-George deleted the l2data branch December 13, 2023 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create and save L2 data
1 participant