From e1f2209de8732899d05c4166cef5855408c6e620 Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Wed, 9 Oct 2024 20:59:53 +0200 Subject: [PATCH 01/12] docs: :construction: draft of creating and managing resources --- docs/guide/resources.qmd | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) create mode 100644 docs/guide/resources.qmd diff --git a/docs/guide/resources.qmd b/docs/guide/resources.qmd new file mode 100644 index 000000000..4fdb12882 --- /dev/null +++ b/docs/guide/resources.qmd @@ -0,0 +1,25 @@ +--- +title: "Creating and managing data resources" +order: 2 +--- + +In each [data package](/docs/design/implementation/outputs.qmd) are [data resources](/docs/design/implementation/outputs.qmd), +which contain a conceptually standalone set of data. This page shows +you how to create and manage data resources inside a data package using Sprout. +We assume that a data package has already been [created](packages.qmd). + + + +## Creating a data resource + +::::: panel-tabset +### Python + +### CLI + +### Web App + +::: callout-warning +In development. +::: +::::: From 43cc868623697faaddaf5d92ae2a7b842c173e62 Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Tue, 18 Feb 2025 10:06:55 +0100 Subject: [PATCH 02/12] docs: :memo: draft of resource guide --- docs/guide/resources.qmd | 313 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 301 insertions(+), 12 deletions(-) diff --git a/docs/guide/resources.qmd b/docs/guide/resources.qmd index 4fdb12882..82eb18fc7 100644 --- a/docs/guide/resources.qmd +++ b/docs/guide/resources.qmd @@ -1,25 +1,314 @@ --- title: "Creating and managing data resources" order: 2 +jupyter: python3 +execute: + eval: false --- -In each [data package](/docs/design/implementation/outputs.qmd) are [data resources](/docs/design/implementation/outputs.qmd), -which contain a conceptually standalone set of data. This page shows -you how to create and manage data resources inside a data package using Sprout. -We assume that a data package has already been [created](packages.qmd). +In each [data package](/docs/design/interface/outputs.qmd) are [data +resources](/docs/design/interface/outputs.qmd), which contain a +conceptually standalone set of data. This page shows you how to create +and manage data resources inside a data package using Sprout. We assume +that a data package has already been [created](packages.qmd). - +{{< include _preamble.qmd >}} + +::: callout-important +Data resources can only be created from [tidy +data](https://design.seedcase-project.org/data/). Before you can store +it, you need to process it into a tidy format, ideally using Python so +that you have a record of the steps taken to clean and transform the +data. +::: + +```{python setup} +#| include: false +# This `setup` code chunk loads packages and prepares the data. +import seedcase_sprout.core as sp +import tempfile +from urllib.request import urlretrieve + +temp_path = tempfile.TemporaryDirectory() +package_path = sp.create_package_properties( + properties=sp.example_package_properties(), + path=temp_path / "diabetes-study" +) +readme = sp.build_readme_text(sp.example_package_properties()) +sp.write_text(readme, package_path.parent) + +# Since the path leads to the datapackage.json file, for later functions we need the folder instead. +package_path = package_path.parent + +# TODO: Maybe eventually move this over into Sprout as an example dataset, rather than via a URL. +# Download the example data and save to a data-raw folder in the temp path. +url = "https://raw.githubusercontent.com/seedcase-project/data/refs/heads/main/patients/patients.csv" +raw_data_path = temp_path / "patients.csv" +urlretrieve( + url, + raw_data_path +) +``` + +Making a data resource requires that you actually have data that can be +a resource in the first place. Generated or collected data always starts +out in a bit of a "raw" shape that needs some working. For this guide, +we have a raw (but fake) data file that we've already made tidy and that +looks like: + +```{python} +#| echo: false +with open(raw_data_path, "r") as f: + print(f.read()) +``` + +We've saved this data file in a path object called `raw_data_path`: + +```{python} +print(raw_data_path) +``` + +The thing you want to do to your raw data is get it into the data +package to make it easier for yourself and others to use later one. So +the steps we'll take to get this raw data into the structure offered by +Sprout are: + +1. Create the properties for the resource, using the original raw data + as a starting point and edit as needed. +2. Create a folder to store the (processed) data resource in our + package, as well as having a folder for the (tidy) raw data. +3. Save the properties information and path to the new data resource + into the `datapackage.json` file. +4. Re-build the data package's `README.md` file from the updated + `datapackage.json` file. +5. If you need to edit the properties at a later point, ideally you do + it using `edit_resource_properties()` and then re-build the + `datapackage.json` file. + +Before we start, we need to import the Sprout package as well as other +helper packages: + +```{python} +import seedcase_sprout.core as sp + +# For pretty printing of output +from pprint import pprint + +# TODO: This could be a wrapper helper function instead +# To be able to write multiline strings without indentation +from textwrap import dedent +``` + +## Extract, then edit resource properties from raw data + +Because the resource's properties are useful for many later functions, +let's first get that created and ready to go. While you can create a +resource properties object manually using `ResourceProperties`, it is +quite intensive and time-consuming. The better and easier approach is to +extract as much information as possible from the raw data to create an +initial resource properties object by using +`extract_resource_properties()`. Then, you can edit the properties as +needed. + +Let's start with extracting the resource properties from the raw data. +The function is fairly good at getting and guessing the right +information, but it is very far from perfect and it just cannot guess +certain things from the data. + +```{python} +# Location where the raw data is stored: +print(raw_data_path) +resource_properties = extract_resource_properties( + data_path=raw_data_path +) +pprint(resource_properties) +``` + +You may be able to see that some things are missing, for instance, the +individual columns (called fields) don't have any descriptions. This we +have to manually add ourselves. Let's fill in the description for all +the fields in the resource: + +```{python} +# TODO: Need to consider/design how editing can be done in an easier, user-friendly way. +# TODO: Add more detail when we know what can and can't be extracted. +``` ## Creating a data resource -::::: panel-tabset -### Python +Now that we have the properties for the resource, we can create the +resource itself within the data package. What this means is that we want +a folder for the specific resource (since we may have many more data +resources to add). + +Our package has already been created (using the steps from the [package +guide](packages.qmd)), with the path set as the variable: + +```{python} +print(package_path) +``` + +We can look inside that path to see: + +```{python} +print(list(package_path.glob("*"))) +``` + +Which shows that the `datapackage.json` file and the `README.md` file +have already been created. So we will create the resource structure in +this package. -### CLI +```{python} +resource_paths=sp.create_resource_structure( + path=package_path +) +print(resource_paths) +``` -### Web App +This now has created the folder structure needed and is ready to fill it +with our raw data! Next, we'll setup the resource properties so that it +is ready to be saved into the `datapackage.json` file. We can use the +`path_resource()` helper function to always give us the correct location +to the specific resource's folder path. In this case, our resource is +the first one in the package, so we can use `path_resource(1)`. -::: callout-warning -In development. +```{python} +resource_properties = sp.create_resource_properties( + properties=resource_properties, + path=package_path / sp.path_resource(1) +) +pprint(resource_properties) +``` + +::: callout-tip +If you want to see the list of resources available in your data package +via Python code (rather than looking at it directly in the file system), +you can use the `list_resources()` function. + +```{python} +#| eval: false +print(sp.list_resources()) +``` ::: -::::: + +This has setup the properties to be ready to add to the +`datapackage.json` file. Next, we save that properties file into the +`datapackage.json` file by writing it to the `datapackage.json` file: + +```{python} +sp.write_resource_properties( + properties=resource_properties, + path=sp.path_properties() +) +``` + +We can check the contents of the `datapackage.json` file to see that the +resource properties have been added: + +```{python} +pprint(sp.read_properties(package_path / sp.path_properties())) +``` + +## Storing a backup of the raw data + +Before we start processing the raw data into a Parquet file, it is a +good idea to store a backup of the raw data. This is useful if you need +to re-process the data at a later point, troubleshoot any issues, update +incorrect values, or if you need to compare the stored raw data to your +original raw data. + +As we showed above, the data is stored in the path that we've set as +`raw_data_path`. We can store this data in the resource's folder by +using: + +```{python} +sp.write_resource_data_to_raw( + data_path=raw_data_path, + resource_properties=resource_properties +) +``` + +This function uses the properties object to determine where to store the +raw data, which is in the `raw/` folder of the resource's folder. We can +check the newly added file by using: + +```{python} +print(sp.path_resource_raw_files(1)) +``` + +## Building the Parquet data resource file + +Now that we've stored the raw data file, we can build the Parquet file +that will be used as the data resource. This Parquet file is built from +the raw data file that we've stored in the resource's folder. + +```{python} +parquet_path = sp.build_resource_parquet( + raw_files=sp.path_resource_raw_files(1), + path=sp.path_resource_data(1) +) +print(parquet_path) +``` + +## Re-building the README file + +One of the last steps to finish with adding a new data resource is to +re-build the `README.md` file for the data package. To allow some +flexibility with what gets added to the README text, this next function +will only *build the text*, but not write it to the file. This allows +you to add additional information to the README text before writing it +to the file. + +```{python} +readme_text = sp.build_readme_text( + properties=sp.read_properties(package_path / sp.path_properties()) +) +``` + +In this case, we don't want to add anything else, so we'll write the +README text to the `README.md` file: + +```{python} +sp.write_text( + text=readme_text, + # TODO: Make a helper function for this path? + path=package_path / "README.md" +) +``` + +## Edit resource properties + +After having created a resource, you may need to make edits to the +properties. While technically you can do this manually by opening up the +`datapackage.json` file and editing it, we've made these functions to +help do it in an easier and correct way. Using the +`edit_resource_properties()` function, you give it the path to the +current properties and then create a new `ResourceProperties` object +with any changes you want to make. Anything in the new properties object +will overwrite fields in the old properties object. This function does +not write back, it only returns the new properties object. + +```{python} +resource_properties = sp.edit_resource_properties( + # Helper function + path=sp.path_properties(), + properties=sp.ResourceProperties( + title="Basic characteristics of patients" + ) +) +pprint(resource_properties) +``` + +To write back, you use the `write_resource_properties()` function: + +```{python} +sp.write_resource_properties( + properties=resource_properties, + path=sp.path_properties() +) +``` + +```{python} +#| include: false +temp_path.cleanup() +``` From a9450005d4af2aa5bc778e57b857c86a39c7d9f6 Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Thu, 20 Feb 2025 09:19:14 +0100 Subject: [PATCH 03/12] docs: :pencil2: clarifications from review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Signe Kirk Brødbæk <40836345+signekb@users.noreply.github.com> --- docs/guide/resources.qmd | 47 +++++++++++++++++++++------------------- 1 file changed, 25 insertions(+), 22 deletions(-) diff --git a/docs/guide/resources.qmd b/docs/guide/resources.qmd index 82eb18fc7..54f210c5d 100644 --- a/docs/guide/resources.qmd +++ b/docs/guide/resources.qmd @@ -68,8 +68,8 @@ We've saved this data file in a path object called `raw_data_path`: print(raw_data_path) ``` -The thing you want to do to your raw data is get it into the data -package to make it easier for yourself and others to use later one. So +Putting your raw data into a data +package makes it easier for yourself and others to use later one. So the steps we'll take to get this raw data into the structure offered by Sprout are: @@ -77,15 +77,15 @@ Sprout are: as a starting point and edit as needed. 2. Create a folder to store the (processed) data resource in our package, as well as having a folder for the (tidy) raw data. -3. Save the properties information and path to the new data resource +3. Save the properties of and path to the new data resource into the `datapackage.json` file. 4. Re-build the data package's `README.md` file from the updated `datapackage.json` file. -5. If you need to edit the properties at a later point, ideally you do - it using `edit_resource_properties()` and then re-build the +5. If you need to edit the properties at a later point, you can use + `edit_resource_properties()` and then re-build the `datapackage.json` file. -Before we start, we need to import the Sprout package as well as other +Before we start, we need to import Sprout as well as other helper packages: ```{python} @@ -116,8 +116,6 @@ information, but it is very far from perfect and it just cannot guess certain things from the data. ```{python} -# Location where the raw data is stored: -print(raw_data_path) resource_properties = extract_resource_properties( data_path=raw_data_path ) @@ -125,8 +123,10 @@ pprint(resource_properties) ``` You may be able to see that some things are missing, for instance, the -individual columns (called fields) don't have any descriptions. This we -have to manually add ourselves. Let's fill in the description for all +individual columns (called fields) don't have any descriptions. We have to +manually add this ourselves. + +Let's fill in the description for all the fields in the resource: ```{python} @@ -138,25 +138,24 @@ the fields in the resource: Now that we have the properties for the resource, we can create the resource itself within the data package. What this means is that we want -a folder for the specific resource (since we may have many more data +a folder for the specific resource (since we may have more data resources to add). Our package has already been created (using the steps from the [package -guide](packages.qmd)), with the path set as the variable: +guide](packages.qmd)), with the path set as the variable `package_path`: ```{python} print(package_path) ``` -We can look inside that path to see: +We can look inside that path to see the current files and folders: ```{python} print(list(package_path.glob("*"))) ``` -Which shows that the `datapackage.json` file and the `README.md` file -have already been created. So we will create the resource structure in -this package. +This shows that the data package already includes a `datapackage.json` file and a `README.md` file. Now, we will create the resource structure in +this package: ```{python} resource_paths=sp.create_resource_structure( @@ -165,8 +164,8 @@ resource_paths=sp.create_resource_structure( print(resource_paths) ``` -This now has created the folder structure needed and is ready to fill it -with our raw data! Next, we'll setup the resource properties so that it +With the the resource folder structure created, we are now ready to fill it +with our raw data! Next, we'll set up the resource properties so that it is ready to be saved into the `datapackage.json` file. We can use the `path_resource()` helper function to always give us the correct location to the specific resource's folder path. In this case, our resource is @@ -191,7 +190,7 @@ print(sp.list_resources()) ``` ::: -This has setup the properties to be ready to add to the +This has set up the properties to be ready to add to the `datapackage.json` file. Next, we save that properties file into the `datapackage.json` file by writing it to the `datapackage.json` file: @@ -250,9 +249,13 @@ parquet_path = sp.build_resource_parquet( print(parquet_path) ``` +::: callout-tip +If you add more raw data to the resource later on, you can update this Parquet file to include all data in the raw folder using the `build_resource_parquet()` function like shown above. +::: + ## Re-building the README file -One of the last steps to finish with adding a new data resource is to +One of the last steps to finish adding a new data resource is to re-build the `README.md` file for the data package. To allow some flexibility with what gets added to the README text, this next function will only *build the text*, but not write it to the file. This allows @@ -266,7 +269,7 @@ readme_text = sp.build_readme_text( ``` In this case, we don't want to add anything else, so we'll write the -README text to the `README.md` file: +text to the `README.md` file: ```{python} sp.write_text( @@ -281,7 +284,7 @@ sp.write_text( After having created a resource, you may need to make edits to the properties. While technically you can do this manually by opening up the `datapackage.json` file and editing it, we've made these functions to -help do it in an easier and correct way. Using the +help do it in an easier way that ensures that the `datapackage.json` is still in a correct json format. Using the `edit_resource_properties()` function, you give it the path to the current properties and then create a new `ResourceProperties` object with any changes you want to make. Anything in the new properties object From 426a19aac17e8f79f688bbebb76d27ab932edd7f Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Thu, 20 Feb 2025 08:20:45 +0000 Subject: [PATCH 04/12] chore(pre-commit): :pencil2: automatic fixes --- docs/guide/resources.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guide/resources.qmd b/docs/guide/resources.qmd index 54f210c5d..aef193850 100644 --- a/docs/guide/resources.qmd +++ b/docs/guide/resources.qmd @@ -124,7 +124,7 @@ pprint(resource_properties) You may be able to see that some things are missing, for instance, the individual columns (called fields) don't have any descriptions. We have to -manually add this ourselves. +manually add this ourselves. Let's fill in the description for all the fields in the resource: From e34527a11e3f19e3b2b673ee32c8a8386cba2217 Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Thu, 20 Feb 2025 09:25:31 +0100 Subject: [PATCH 05/12] docs: :pencil2: clarifications from review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Signe Kirk Brødbæk <40836345+signekb@users.noreply.github.com> --- docs/guide/resources.qmd | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/guide/resources.qmd b/docs/guide/resources.qmd index aef193850..e02ad6605 100644 --- a/docs/guide/resources.qmd +++ b/docs/guide/resources.qmd @@ -103,17 +103,16 @@ from textwrap import dedent Because the resource's properties are useful for many later functions, let's first get that created and ready to go. While you can create a -resource properties object manually using `ResourceProperties`, it is -quite intensive and time-consuming. The better and easier approach is to +resource properties object manually using `ResourceProperties`, it can be +quite intensive and time-consuming if you for example have many columns in your data. The better and easier approach is to extract as much information as possible from the raw data to create an -initial resource properties object by using +initial resource properties object with `extract_resource_properties()`. Then, you can edit the properties as needed. Let's start with extracting the resource properties from the raw data. The function is fairly good at getting and guessing the right -information, but it is very far from perfect and it just cannot guess -certain things from the data. +information, but it is very far from perfect and it cannot guess things that are not in the data itself, like a description of what the data contains or the unit of the data. ```{python} resource_properties = extract_resource_properties( From 741edcc159f7066c201486a1a843c3f37b56e0f4 Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Thu, 20 Feb 2025 09:29:27 +0100 Subject: [PATCH 06/12] docs: :memo: small updates to resource guide --- docs/guide/resources.qmd | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/guide/resources.qmd b/docs/guide/resources.qmd index e02ad6605..2f2868e72 100644 --- a/docs/guide/resources.qmd +++ b/docs/guide/resources.qmd @@ -99,7 +99,7 @@ from pprint import pprint from textwrap import dedent ``` -## Extract, then edit resource properties from raw data +## Extract resource properties from raw data Because the resource's properties are useful for many later functions, let's first get that created and ready to go. While you can create a @@ -124,6 +124,12 @@ pprint(resource_properties) You may be able to see that some things are missing, for instance, the individual columns (called fields) don't have any descriptions. We have to manually add this ourselves. +We can run a check on the properties to confirm what is missing: + +```{python} +#| error: true +print(sp.check_resource_properties(resource_properties)) +``` Let's fill in the description for all the fields in the resource: From b5f487d8d4c958306f09910f4525700ef587290c Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Mon, 24 Feb 2025 18:41:05 +0100 Subject: [PATCH 07/12] docs: :pencil2: need to use mkdir to make the folder --- docs/guide/resources.qmd | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/guide/resources.qmd b/docs/guide/resources.qmd index 2f2868e72..2aed17ff0 100644 --- a/docs/guide/resources.qmd +++ b/docs/guide/resources.qmd @@ -30,9 +30,11 @@ import tempfile from urllib.request import urlretrieve temp_path = tempfile.TemporaryDirectory() +package_path = temp_path / "diabetes-study" +package_path.mkdir() package_path = sp.create_package_properties( properties=sp.example_package_properties(), - path=temp_path / "diabetes-study" + path=package_path ) readme = sp.build_readme_text(sp.example_package_properties()) sp.write_text(readme, package_path.parent) From 6a6df8bb6810817b1e882d44169a8fbb56d4dcda Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Thu, 6 Mar 2025 19:59:32 +0100 Subject: [PATCH 08/12] docs: :pencil2: suggestions from review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Signe Kirk Brødbæk <40836345+signekb@users.noreply.github.com> --- docs/guide/resources.qmd | 56 +++++++++++++++++++--------------------- 1 file changed, 27 insertions(+), 29 deletions(-) diff --git a/docs/guide/resources.qmd b/docs/guide/resources.qmd index 2aed17ff0..4bbcba56a 100644 --- a/docs/guide/resources.qmd +++ b/docs/guide/resources.qmd @@ -10,7 +10,7 @@ In each [data package](/docs/design/interface/outputs.qmd) are [data resources](/docs/design/interface/outputs.qmd), which contain a conceptually standalone set of data. This page shows you how to create and manage data resources inside a data package using Sprout. We assume -that a data package has already been [created](packages.qmd). +that you've already [created a data package](packages.qmd). {{< include _preamble.qmd >}} @@ -26,13 +26,14 @@ data. #| include: false # This `setup` code chunk loads packages and prepares the data. import seedcase_sprout.core as sp -import tempfile +from tempfile import mkdtemp +from pathlib import Path from urllib.request import urlretrieve -temp_path = tempfile.TemporaryDirectory() +temp_path = Path(mkdtemp()) package_path = temp_path / "diabetes-study" package_path.mkdir() -package_path = sp.create_package_properties( +sp.create_package_properties( properties=sp.example_package_properties(), path=package_path ) @@ -52,9 +53,10 @@ urlretrieve( ) ``` -Making a data resource requires that you actually have data that can be -a resource in the first place. Generated or collected data always starts -out in a bit of a "raw" shape that needs some working. For this guide, +```suggestion +Making a data resource requires that you have data that can be made into +a resource in the first place. Usually, generated or collected data always starts +out in a bit of a "raw" shape that needs some working. This work needs to be done before adding the data as a data package, since Sprout assumes that the data is already [tidy](https://design.seedcase-project.org/data/). For this guide, we have a raw (but fake) data file that we've already made tidy and that looks like: @@ -64,7 +66,8 @@ with open(raw_data_path, "r") as f: print(f.read()) ``` -We've saved this data file in a path object called `raw_data_path`: +If you want to follow this guide with the same data, you can find it [here](). +We'll save this data in a path called `raw_data_path`: ```{python} print(raw_data_path) @@ -103,18 +106,16 @@ from textwrap import dedent ## Extract resource properties from raw data -Because the resource's properties are useful for many later functions, -let's first get that created and ready to go. While you can create a +We'll start by creating the resource's properties. This is because the resource's properties will be used for checking that the actual data matches the properties later on. While you can create a resource properties object manually using `ResourceProperties`, it can be -quite intensive and time-consuming if you for example have many columns in your data. The better and easier approach is to +quite intensive and time-consuming if you for example have many columns in your data. To ease this process, you can extract as much information as possible from the raw data to create an initial resource properties object with `extract_resource_properties()`. Then, you can edit the properties as needed. Let's start with extracting the resource properties from the raw data. -The function is fairly good at getting and guessing the right -information, but it is very far from perfect and it cannot guess things that are not in the data itself, like a description of what the data contains or the unit of the data. +While this function tries to infer the data types in the raw data, it might not get it right. So, be sure to check the properties after using this function. It can also not infer things that are not in the data itself, like a description of what the data contains or the unit of the data. ```{python} resource_properties = extract_resource_properties( @@ -124,7 +125,7 @@ pprint(resource_properties) ``` You may be able to see that some things are missing, for instance, the -individual columns (called fields) don't have any descriptions. We have to +individual columns (called `fields`) don't have any descriptions. We have to manually add this ourselves. We can run a check on the properties to confirm what is missing: @@ -144,18 +145,18 @@ the fields in the resource: ## Creating a data resource Now that we have the properties for the resource, we can create the -resource itself within the data package. What this means is that we want +resource itself within the data package. What this means is that we will create a folder for the specific resource (since we may have more data resources to add). -Our package has already been created (using the steps from the [package +We've already create a package (using the steps from the [package guide](packages.qmd)), with the path set as the variable `package_path`: ```{python} print(package_path) ``` -We can look inside that path to see the current files and folders: +Let's take a look at the current files and folders in the data package: ```{python} print(list(package_path.glob("*"))) @@ -165,14 +166,12 @@ This shows that the data package already includes a `datapackage.json` file and this package: ```{python} -resource_paths=sp.create_resource_structure( +sp.create_resource_structure( path=package_path ) -print(resource_paths) ``` -With the the resource folder structure created, we are now ready to fill it -with our raw data! Next, we'll set up the resource properties so that it +Next, we'll set up the resource properties so that it is ready to be saved into the `datapackage.json` file. We can use the `path_resource()` helper function to always give us the correct location to the specific resource's folder path. In this case, our resource is @@ -197,9 +196,8 @@ print(sp.list_resources()) ``` ::: -This has set up the properties to be ready to add to the -`datapackage.json` file. Next, we save that properties file into the -`datapackage.json` file by writing it to the `datapackage.json` file: +Now, the resource properties are ready to be added to the +`datapackage.json` file. To do this, we can use the `write_resource_properties()` function: ```{python} sp.write_resource_properties( @@ -208,7 +206,7 @@ sp.write_resource_properties( ) ``` -We can check the contents of the `datapackage.json` file to see that the +Let's check the contents of the `datapackage.json` file to see that the resource properties have been added: ```{python} @@ -246,7 +244,7 @@ print(sp.path_resource_raw_files(1)) Now that we've stored the raw data file, we can build the Parquet file that will be used as the data resource. This Parquet file is built from -the raw data file that we've stored in the resource's folder. +the all the data in the `raw/` folder. Since we only have one raw data file stored in the resource's folder, only this will be used to build the data resource's parquet file: ```{python} parquet_path = sp.build_resource_parquet( @@ -291,10 +289,10 @@ sp.write_text( After having created a resource, you may need to make edits to the properties. While technically you can do this manually by opening up the `datapackage.json` file and editing it, we've made these functions to -help do it in an easier way that ensures that the `datapackage.json` is still in a correct json format. Using the +help do it in a way that ensures that the `datapackage.json` is still in a correct json format. Using the `edit_resource_properties()` function, you give it the path to the current properties and then create a new `ResourceProperties` object -with any changes you want to make. Anything in the new properties object +with the changes you want to make. Anything in the new properties object will overwrite fields in the old properties object. This function does not write back, it only returns the new properties object. @@ -309,7 +307,7 @@ resource_properties = sp.edit_resource_properties( pprint(resource_properties) ``` -To write back, you use the `write_resource_properties()` function: +To write back, use the `write_resource_properties()` function: ```{python} sp.write_resource_properties( From 3aa3bdd0027f54ca00ad16034f985d5a26343814 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Thu, 6 Mar 2025 18:59:45 +0000 Subject: [PATCH 09/12] chore(pre-commit): :pencil2: automatic fixes --- docs/guide/resources.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guide/resources.qmd b/docs/guide/resources.qmd index 4bbcba56a..f696e26bb 100644 --- a/docs/guide/resources.qmd +++ b/docs/guide/resources.qmd @@ -66,7 +66,7 @@ with open(raw_data_path, "r") as f: print(f.read()) ``` -If you want to follow this guide with the same data, you can find it [here](). +If you want to follow this guide with the same data, you can find it [here](). We'll save this data in a path called `raw_data_path`: ```{python} From e9a4d9e2fa4b809b0ebdd2c960062b5a4b5bb55a Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Thu, 6 Mar 2025 20:40:36 +0100 Subject: [PATCH 10/12] docs: :memo: address comments from review --- docs/guide/resources.qmd | 242 +++++++++++++++++++++------------------ 1 file changed, 128 insertions(+), 114 deletions(-) diff --git a/docs/guide/resources.qmd b/docs/guide/resources.qmd index f696e26bb..85b98035a 100644 --- a/docs/guide/resources.qmd +++ b/docs/guide/resources.qmd @@ -9,8 +9,8 @@ execute: In each [data package](/docs/design/interface/outputs.qmd) are [data resources](/docs/design/interface/outputs.qmd), which contain a conceptually standalone set of data. This page shows you how to create -and manage data resources inside a data package using Sprout. We assume -that you've already [created a data package](packages.qmd). +and manage data resources inside a data package using Sprout. You will +need to have already [created a data package](packages.qmd). {{< include _preamble.qmd >}} @@ -22,6 +22,22 @@ that you have a record of the steps taken to clean and transform the data. ::: +Putting your raw data into a data package makes it easier for yourself +and others to use later one. So the steps you'll take to get this raw +data into the structure offered by Sprout are: + +1. Create the properties for the resource, using the original raw data + as a starting point and edit as needed. +2. Create a folder to store the (processed) data resource in your + package, as well as having a folder for the (tidy) raw data. +3. Save the properties of and path to the new data resource into the + `datapackage.json` file. +4. Re-build the data package's `README.md` file from the updated + `datapackage.json` file. +5. If you need to edit the properties at a later point, you can use + `edit_resource_properties()` and then re-build the + `datapackage.json` file. + ```{python setup} #| include: false # This `setup` code chunk loads packages and prepares the data. @@ -38,7 +54,7 @@ sp.create_package_properties( path=package_path ) readme = sp.build_readme_text(sp.example_package_properties()) -sp.write_text(readme, package_path.parent) +sp.write_file(readme, package_path.parent) # Since the path leads to the datapackage.json file, for later functions we need the folder instead. package_path = package_path.parent @@ -53,12 +69,13 @@ urlretrieve( ) ``` -```suggestion Making a data resource requires that you have data that can be made into -a resource in the first place. Usually, generated or collected data always starts -out in a bit of a "raw" shape that needs some working. This work needs to be done before adding the data as a data package, since Sprout assumes that the data is already [tidy](https://design.seedcase-project.org/data/). For this guide, -we have a raw (but fake) data file that we've already made tidy and that -looks like: +a resource in the first place. Usually, generated or collected data +always starts out in a bit of a "raw" shape that needs some working. +This work needs to be done before adding the data as a data package, +since Sprout assumes that the data is already +[tidy](https://design.seedcase-project.org/data/). For this guide, we +use a raw (but fake) data file that is already tidy and that looks like: ```{python} #| echo: false @@ -66,32 +83,18 @@ with open(raw_data_path, "r") as f: print(f.read()) ``` -If you want to follow this guide with the same data, you can find it [here](). -We'll save this data in a path called `raw_data_path`: +If you want to follow this guide with the same data, you can find it +[here](https://raw.githubusercontent.com/seedcase-project/data/refs/heads/main/patients/patients.csv). +The path to this data, which is stored in a variable called +`raw_data_path`, is: ```{python} +#| echo: false print(raw_data_path) ``` -Putting your raw data into a data -package makes it easier for yourself and others to use later one. So -the steps we'll take to get this raw data into the structure offered by -Sprout are: - -1. Create the properties for the resource, using the original raw data - as a starting point and edit as needed. -2. Create a folder to store the (processed) data resource in our - package, as well as having a folder for the (tidy) raw data. -3. Save the properties of and path to the new data resource - into the `datapackage.json` file. -4. Re-build the data package's `README.md` file from the updated - `datapackage.json` file. -5. If you need to edit the properties at a later point, you can use - `edit_resource_properties()` and then re-build the - `datapackage.json` file. - -Before we start, we need to import Sprout as well as other -helper packages: +Before we start, you need to import Sprout as well as other helper +packages: ```{python} import seedcase_sprout.core as sp @@ -106,16 +109,23 @@ from textwrap import dedent ## Extract resource properties from raw data -We'll start by creating the resource's properties. This is because the resource's properties will be used for checking that the actual data matches the properties later on. While you can create a -resource properties object manually using `ResourceProperties`, it can be -quite intensive and time-consuming if you for example have many columns in your data. To ease this process, you can -extract as much information as possible from the raw data to create an -initial resource properties object with -`extract_resource_properties()`. Then, you can edit the properties as -needed. - -Let's start with extracting the resource properties from the raw data. -While this function tries to infer the data types in the raw data, it might not get it right. So, be sure to check the properties after using this function. It can also not infer things that are not in the data itself, like a description of what the data contains or the unit of the data. +You'll start by creating the resource's properties. Before you can have +data stored in a data package, it needs metadata (called properties) on +the data. The resource's properties are what allow people to more easily +use the data in the data package, as well as being used to check that +the actual data matches the properties. While you can create a resource +properties object manually using `ResourceProperties`, it can be quite +intensive and time-consuming if you, for example, have many columns in +your data. To ease this process, you can extract as much information as +possible from the raw data to create an initial resource properties +object with `extract_resource_properties()`. Then, you can edit the +properties as needed. + +Start with extracting the resource properties from the raw data. While +this function tries to infer the data types in the raw data, it might +not get it right. So, be sure to check the properties after using this +function. It can also not infer things that are not in the data itself, +like a description of what the data contains or the unit of the data. ```{python} resource_properties = extract_resource_properties( @@ -125,32 +135,33 @@ pprint(resource_properties) ``` You may be able to see that some things are missing, for instance, the -individual columns (called `fields`) don't have any descriptions. We have to -manually add this ourselves. -We can run a check on the properties to confirm what is missing: +individual columns (called `fields`) don't have any descriptions. You +will have to manually add this yourself. You can run a check on the +properties to confirm what is missing: ```{python} #| error: true print(sp.check_resource_properties(resource_properties)) ``` -Let's fill in the description for all -the fields in the resource: +Time to fill in the description for all the fields in the resource: ```{python} # TODO: Need to consider/design how editing can be done in an easier, user-friendly way. # TODO: Add more detail when we know what can and can't be extracted. +# TODO: Set the path field here? ``` ## Creating a data resource -Now that we have the properties for the resource, we can create the -resource itself within the data package. What this means is that we will create -a folder for the specific resource (since we may have more data -resources to add). +Now that you have the properties for the resource, you can create the +resource itself within the data package. What this means is that you +will create a folder for the specific resource (since you may have more +data resources to add). -We've already create a package (using the steps from the [package -guide](packages.qmd)), with the path set as the variable `package_path`: +We assume you've already create a package (either by using the steps +from the [package guide](packages.qmd) or started making one for your +own data), with the path set as the variable `package_path`: ```{python} print(package_path) @@ -159,30 +170,38 @@ print(package_path) Let's take a look at the current files and folders in the data package: ```{python} -print(list(package_path.glob("*"))) +#| echo: false +print(package_path.glob("**/*")) ``` -This shows that the data package already includes a `datapackage.json` file and a `README.md` file. Now, we will create the resource structure in -this package: +This shows that the data package already includes a `datapackage.json` +file and a `README.md` file. Now to create the resource structure in +this package, using the helper `path_resources()` function to give the +correct path to the resources folder. The default behaviour of +`path_resources()` is to use the current working directory, but for this +guide you'll have to use the `path` argument to point to where the +package is stored in the temporary folder. ```{python} +# TODO: This doesn't work exactly as expected here. We need to create +# the `resources/` folder first. sp.create_resource_structure( - path=package_path + path=sp.path_resources(path=package_path) ) ``` -Next, we'll set up the resource properties so that it -is ready to be saved into the `datapackage.json` file. We can use the -`path_resource()` helper function to always give us the correct location -to the specific resource's folder path. In this case, our resource is -the first one in the package, so we can use `path_resource(1)`. +Next step is to set up the resource properties so that it gets checked +and saved into the `datapackage.json` file. You can use the +`path_properties()` helper function to always give you the correct +location to the `datapackage.json` path. Here, the `path` argument +points to the temporary folder where the package is stored. ```{python} -resource_properties = sp.create_resource_properties( - properties=resource_properties, - path=package_path / sp.path_resource(1) +# TODO: This function needs to be updated to write to data package. +sp.create_resource_properties( + path=sp.path_properties(path=package_path), + properties=resource_properties ) -pprint(resource_properties) ``` ::: callout-tip @@ -192,37 +211,30 @@ you can use the `list_resources()` function. ```{python} #| eval: false -print(sp.list_resources()) +# TODO: This could also be the `path_resources()` which lists the resources. +print(sp.list_resources(path=package_path)) ``` ::: -Now, the resource properties are ready to be added to the -`datapackage.json` file. To do this, we can use the `write_resource_properties()` function: - -```{python} -sp.write_resource_properties( - properties=resource_properties, - path=sp.path_properties() -) -``` - Let's check the contents of the `datapackage.json` file to see that the resource properties have been added: ```{python} -pprint(sp.read_properties(package_path / sp.path_properties())) +pprint(sp.read_properties(sp.path_properties(path=package_path)) ``` ## Storing a backup of the raw data -Before we start processing the raw data into a Parquet file, it is a -good idea to store a backup of the raw data. This is useful if you need -to re-process the data at a later point, troubleshoot any issues, update -incorrect values, or if you need to compare the stored raw data to your -original raw data. +When you create a new data resource, or add data to an existing one, +Sprout has been designed to always store a backup of each added raw +data. All the raw data is stored in a folder called `raw/` within the +resource's folder and is processed into the final Parquet data resource +file. This can be useful if you ever need to re-process the data at a +later point, troubleshoot any issues, update incorrect values, or if you +need to compare the stored raw data to your original raw data. -As we showed above, the data is stored in the path that we've set as -`raw_data_path`. We can store this data in the resource's folder by +As shown above, the data is stored in the path that we've set as +`raw_data_path`. Time to store this data in the resource's folder by using: ```{python} @@ -233,54 +245,55 @@ sp.write_resource_data_to_raw( ``` This function uses the properties object to determine where to store the -raw data, which is in the `raw/` folder of the resource's folder. We can -check the newly added file by using: +raw data, which is in the `raw/` folder of the resource's folder. You +can check the newly added file by using: ```{python} -print(sp.path_resource_raw_files(1)) +print(sp.path_resource_raw_files(1, path=package_path)) ``` ## Building the Parquet data resource file -Now that we've stored the raw data file, we can build the Parquet file +Now that you've stored the raw data file, you can build the Parquet file that will be used as the data resource. This Parquet file is built from -the all the data in the `raw/` folder. Since we only have one raw data file stored in the resource's folder, only this will be used to build the data resource's parquet file: +the all the data in the `raw/` folder. Since there is only one raw data +file stored in the resource's folder, only this one will be used to +build the data resource's parquet file: ```{python} -parquet_path = sp.build_resource_parquet( - raw_files=sp.path_resource_raw_files(1), - path=sp.path_resource_data(1) +sp.build_resource_parquet( + raw_files=sp.path_resource_raw_files(1, path=package_path), + path=sp.path_resource_data(1, path=package_path) ) -print(parquet_path) ``` ::: callout-tip -If you add more raw data to the resource later on, you can update this Parquet file to include all data in the raw folder using the `build_resource_parquet()` function like shown above. +If you add more raw data to the resource later on, you can update this +Parquet file to include all data in the raw folder using the +`build_resource_parquet()` function like shown above. ::: ## Re-building the README file -One of the last steps to finish adding a new data resource is to -re-build the `README.md` file for the data package. To allow some -flexibility with what gets added to the README text, this next function -will only *build the text*, but not write it to the file. This allows -you to add additional information to the README text before writing it -to the file. +One of the last steps to adding a new data resource is to re-build the +`README.md` file for the data package. To allow some flexibility with +what gets added to the README text, this next function will only *build +the text*, but not write it to the file. This allows you to add +additional information to the README text before writing it to the file. ```{python} readme_text = sp.build_readme_text( - properties=sp.read_properties(package_path / sp.path_properties()) + properties=sp.read_properties(sp.path_properties(path=package_path)) ) ``` -In this case, we don't want to add anything else, so we'll write the -text to the `README.md` file: +For this guide, you'll only use the default text and not add anything +else to it. Next you write the text to the `README.md` file by: ```{python} -sp.write_text( +sp.write_file( text=readme_text, - # TODO: Make a helper function for this path? - path=package_path / "README.md" + path=sp.path_readme(path=package_path) ) ``` @@ -288,18 +301,19 @@ sp.write_text( After having created a resource, you may need to make edits to the properties. While technically you can do this manually by opening up the -`datapackage.json` file and editing it, we've made these functions to -help do it in a way that ensures that the `datapackage.json` is still in a correct json format. Using the -`edit_resource_properties()` function, you give it the path to the -current properties and then create a new `ResourceProperties` object -with the changes you want to make. Anything in the new properties object -will overwrite fields in the old properties object. This function does -not write back, it only returns the new properties object. +`datapackage.json` file and editing it, we've strongly recommend you use +the functions to do this. These functions help to ensure that the +`datapackage.json` is still in a correct JSON format and have the +correct fields filled in. Using the `edit_resource_properties()` +function, you give it the path to the current properties and then create +a new `ResourceProperties` object with the changes you want to make. +Anything in the new properties object will overwrite fields in the old +properties object. This function does not write back, it only returns +the new properties object. ```{python} resource_properties = sp.edit_resource_properties( - # Helper function - path=sp.path_properties(), + path=sp.path_properties(path=package_path), properties=sp.ResourceProperties( title="Basic characteristics of patients" ) @@ -312,7 +326,7 @@ To write back, use the `write_resource_properties()` function: ```{python} sp.write_resource_properties( properties=resource_properties, - path=sp.path_properties() + path=sp.path_properties(path=package_path) ) ``` From 3132b61eb5c66ac03cf21025f041776a2962a795 Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Wed, 12 Mar 2025 18:38:15 +0100 Subject: [PATCH 11/12] docs: :pencil2: suggestions from review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Signe Kirk Brødbæk <40836345+signekb@users.noreply.github.com> --- docs/guide/resources.qmd | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/docs/guide/resources.qmd b/docs/guide/resources.qmd index 85b98035a..d049f8384 100644 --- a/docs/guide/resources.qmd +++ b/docs/guide/resources.qmd @@ -23,13 +23,13 @@ data. ::: Putting your raw data into a data package makes it easier for yourself -and others to use later one. So the steps you'll take to get this raw +and others to use later one. So the steps you'll take to get your data into the structure offered by Sprout are: 1. Create the properties for the resource, using the original raw data as a starting point and edit as needed. 2. Create a folder to store the (processed) data resource in your - package, as well as having a folder for the (tidy) raw data. + package, as well as having a folder for the (tidy) batch data. 3. Save the properties of and path to the new data resource into the `datapackage.json` file. 4. Re-build the data package's `README.md` file from the updated @@ -110,7 +110,7 @@ from textwrap import dedent ## Extract resource properties from raw data You'll start by creating the resource's properties. Before you can have -data stored in a data package, it needs metadata (called properties) on +data stored in a data package, it needs properties (i.e., metadata) on the data. The resource's properties are what allow people to more easily use the data in the data package, as well as being used to check that the actual data matches the properties. While you can create a resource @@ -159,9 +159,9 @@ resource itself within the data package. What this means is that you will create a folder for the specific resource (since you may have more data resources to add). -We assume you've already create a package (either by using the steps +We assume you've already created a package (either by using the steps from the [package guide](packages.qmd) or started making one for your -own data), with the path set as the variable `package_path`: +own data), with the path to the data package set as the variable `package_path`: ```{python} print(package_path) @@ -175,8 +175,8 @@ print(package_path.glob("**/*")) ``` This shows that the data package already includes a `datapackage.json` -file and a `README.md` file. Now to create the resource structure in -this package, using the helper `path_resources()` function to give the +file and a `README.md` file. Now you'll add the resource structure to +the package, using the helper function `path_resources()` to give the correct path to the resources folder. The default behaviour of `path_resources()` is to use the current working directory, but for this guide you'll have to use the `path` argument to point to where the @@ -190,14 +190,15 @@ sp.create_resource_structure( ) ``` -Next step is to set up the resource properties so that it gets checked -and saved into the `datapackage.json` file. You can use the -`path_properties()` helper function to always give you the correct +The next step is to add the resource properties to the `datapackage.json` file. +Before they are added, they will be checked to confirm that they are correctly +filled in and that no required fields are missing. You can use the +`path_properties()` helper function to give you the location to the `datapackage.json` path. Here, the `path` argument points to the temporary folder where the package is stored. ```{python} -# TODO: This function needs to be updated to write to data package. +# TODO: This function needs to be updated to write to datapackage.json sp.create_resource_properties( path=sp.path_properties(path=package_path), properties=resource_properties From d02c4fa07403a5da4af8994d603b5894724cf695 Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Wed, 12 Mar 2025 18:43:51 +0100 Subject: [PATCH 12/12] docs: :memo: revise the "raw" section to use "batch" --- docs/guide/resources.qmd | 51 +++++++++++++++++++++------------------- 1 file changed, 27 insertions(+), 24 deletions(-) diff --git a/docs/guide/resources.qmd b/docs/guide/resources.qmd index d049f8384..378e15d98 100644 --- a/docs/guide/resources.qmd +++ b/docs/guide/resources.qmd @@ -23,8 +23,8 @@ data. ::: Putting your raw data into a data package makes it easier for yourself -and others to use later one. So the steps you'll take to get your -data into the structure offered by Sprout are: +and others to use later one. So the steps you'll take to get your data +into the structure offered by Sprout are: 1. Create the properties for the resource, using the original raw data as a starting point and edit as needed. @@ -161,7 +161,8 @@ data resources to add). We assume you've already created a package (either by using the steps from the [package guide](packages.qmd) or started making one for your -own data), with the path to the data package set as the variable `package_path`: +own data), with the path to the data package set as the variable +`package_path`: ```{python} print(package_path) @@ -190,12 +191,12 @@ sp.create_resource_structure( ) ``` -The next step is to add the resource properties to the `datapackage.json` file. -Before they are added, they will be checked to confirm that they are correctly -filled in and that no required fields are missing. You can use the -`path_properties()` helper function to give you the -location to the `datapackage.json` path. Here, the `path` argument -points to the temporary folder where the package is stored. +The next step is to add the resource properties to the +`datapackage.json` file. Before they are added, they will be checked to +confirm that they are correctly filled in and that no required fields +are missing. You can use the `path_properties()` helper function to give +you the location to the `datapackage.json` path. Here, the `path` +argument points to the temporary folder where the package is stored. ```{python} # TODO: This function needs to be updated to write to datapackage.json @@ -224,53 +225,55 @@ resource properties have been added: pprint(sp.read_properties(sp.path_properties(path=package_path)) ``` -## Storing a backup of the raw data +## Storing a backup of the raw data as a "batch" file When you create a new data resource, or add data to an existing one, -Sprout has been designed to always store a backup of each added raw -data. All the raw data is stored in a folder called `raw/` within the +Sprout has been designed to always store a backup of each time you add +new (or modified) data to a specific resource as a "batch" file. All the +batch data files are stored in a folder called `batch/` within the resource's folder and is processed into the final Parquet data resource file. This can be useful if you ever need to re-process the data at a later point, troubleshoot any issues, update incorrect values, or if you -need to compare the stored raw data to your original raw data. +need to compare the stored batch data to your original raw data (before +it enters into the data resource). As shown above, the data is stored in the path that we've set as `raw_data_path`. Time to store this data in the resource's folder by using: ```{python} -sp.write_resource_data_to_raw( +sp.write_resource_batch_data( data_path=raw_data_path, resource_properties=resource_properties ) ``` This function uses the properties object to determine where to store the -raw data, which is in the `raw/` folder of the resource's folder. You -can check the newly added file by using: +data as a batch file, which is in the `batch/` folder of the resource's +folder. You can check the newly added file by using: ```{python} -print(sp.path_resource_raw_files(1, path=package_path)) +print(sp.path_resource_batch_files(1, path=package_path)) ``` ## Building the Parquet data resource file -Now that you've stored the raw data file, you can build the Parquet file -that will be used as the data resource. This Parquet file is built from -the all the data in the `raw/` folder. Since there is only one raw data -file stored in the resource's folder, only this one will be used to -build the data resource's parquet file: +Now that you've stored the data as a batch file, you can build the +Parquet file that will be used as the data resource. This Parquet file +is built from the all the data in the `batch/` folder. Since there is +only one batch data file stored in the resource's folder, only this one +will be used to build the data resource's parquet file: ```{python} sp.build_resource_parquet( - raw_files=sp.path_resource_raw_files(1, path=package_path), + raw_files=sp.path_resource_batch_files(1, path=package_path), path=sp.path_resource_data(1, path=package_path) ) ``` ::: callout-tip If you add more raw data to the resource later on, you can update this -Parquet file to include all data in the raw folder using the +Parquet file to include all data in the batch folder using the `build_resource_parquet()` function like shown above. :::