Skip to content

Read scripts

Kathe Todd-Brown edited this page Jul 6, 2024 · 3 revisions
-- Read scripts
Lead/prep Understand the data model.
Do Transform the dataset into a standardized id-variable-type-entry tuple.
Measure Are the transformations that the data went through clear (good comments)?

The intent of this phrase of the work flow is to access the data and transform it into a common data model (or data structure). While each read script is unique, there are three general phases:

  1. download the data and associated documentation,
  2. read the data into the environment,
  3. merge all the data into a single long table.

Downloads

Data should be sourced from a publicly available archive with permanent URLs. In the download phase the data is downloaded from the archive (generally using the download.url function in R) to a local folder identified by the user. All associated tables and documentation in that data package should be downloaded if possible (and reasons for deviations from this noted in the comments). For some particularly large files it is advisable to put a message before the download asking the user to manually download the data to the desired folder but generally a download command should still be included for clarity. Some files may be hidden behind splash pages (ie pop-ups that ask you to affirm specific use-age or log on for access) and are not accessible via a public download URL. In the case where public download URLs are not present, if the expected files are not found locally then throw an error using stop and instruct the user how to download the data manually.

Loading the data

Data should then be loaded into the R environment. Currently this project uses the readr::read_* family of functions, and text strings should be read in whenever possible to avoid numerical error, mangled date strings, and other issues that will be taken care of in the QA/QC phase. In addition, the data annotations for this data should be read in as well.

This collection of data frames is sometimes referred to as Level 0 data (following the Environmental Data Initiative's Thematic Standardization approach https://edirepository.org/resources/thematic-standardization)

Shoestring the data

Finally the data tables need to be pivoted to create id-variable-type-entry tables then joined with other tables. This can be done in a few ways and we will walk through one in detail below:

  1. Join the data tables on their common columns and foreign keys using a dplyr::full_join. The exact column matches and order of the joins matter and will vary for each data package, but your goal here is to create a single wide data table with datum grouped by row (ie one row corresponds to a single layer that has the associated geospatial information repeated for each layer in the site).
    • If there is a column that has the same name and different information across two tables, you will need to change the name of those columns (suggestion is to pre-append the table name to the column name). You will need to mirror this in the annotations table in the codebase.
  2. Identify the identifying columns for the table. This is often a (site_id, layer_id) set that uniquely identify individual rows but could be other identifying columns.
    • If a column is both an identifier and information holding (ie place name or depth interval) you will need to duplicate the column, creating one column for identification and a second column for the information.
    • If there is no unique identifier it maybe neccessary to add one at this point.
  3. Shoestring or dplyr::pivot_longer the columns of each data table that are not identifying columns. This should result in a table with the (identifying columns, column name, column entry).
  4. Join the data annotations with the primary data to get the of_variable and is_type notations. (see Annotations for how to structure this table.)
    1. Filter the annotations on the primary data reference rows (eg dplyr::filter(.data = dataAnnotation, with_entry == '--')).
    2. Select the column name, variable and is_type (eg dplyr::select(.data = shortAnnotation, column_id, of_variable, is_type))
    3. Join the resulting table with the primary data that you just shoe-stringed above on the column_id.
  5. Rename the 'column entry' with_entry to match the structure of the annotations table.

The final result should be a single primary data table that looks similar to the data annotations structure with the headers (column_id, of_variable, is_type, with_entry).

FAQ

What is a data model?

A data model refers to the structure of the data. Generally this includes the tables and their associated columns as well as any associated meta-data that is not in the primary tables.

Why work with archived data?

Generally we strive to work only with data sets that are publicly available on an archive. Data on public archives is stable and not going to change. This stability means that work invested into annotations and working with the data tables will not need to be repeated as the original data model evolves. If the data provider (or someone else) wants to work with non-archive data, we recommend forking this repository and developing the scripts in the forked repository.

But what about the units?

Units for some measurements are tied to the use of the data. At the read script state we are looking to preserve the original data model. Unit conversions are part of the integration step since this may chance for each data collection purpose.