Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: 📝 add start of the creating resources guide #810

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open
Changes from 14 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
e1f2209
docs: :construction: draft of creating and managing resources
lwjohnst86 Oct 9, 2024
8a309c3
Merge branch 'main' into docs/guide-for-managing-resources
signekb Oct 24, 2024
b1fe1c8
Merge branch 'main' of https://github.com/seedcase-project/seedcase-s…
lwjohnst86 Nov 4, 2024
f748490
Merge branch 'main' of https://github.com/seedcase-project/seedcase-s…
lwjohnst86 Nov 8, 2024
54e61be
Merge branch 'docs/guide-for-managing-resources' of https://github.co…
lwjohnst86 Nov 11, 2024
f2cab3e
Merge branch 'docs/guide-for-managing-resources' of https://github.co…
lwjohnst86 Dec 4, 2024
3562449
Merge branch 'main' of https://github.com/seedcase-project/seedcase-s…
lwjohnst86 Feb 17, 2025
43cc868
docs: :memo: draft of resource guide
lwjohnst86 Feb 18, 2025
a945000
docs: :pencil2: clarifications from review
lwjohnst86 Feb 20, 2025
426a19a
chore(pre-commit): :pencil2: automatic fixes
pre-commit-ci[bot] Feb 20, 2025
1f0bceb
Merge branch 'main' of https://github.com/seedcase-project/seedcase-s…
lwjohnst86 Feb 20, 2025
e34527a
docs: :pencil2: clarifications from review
lwjohnst86 Feb 20, 2025
397796a
Merge branch 'docs/guide-for-managing-resources' of https://github.co…
lwjohnst86 Feb 20, 2025
741edcc
docs: :memo: small updates to resource guide
lwjohnst86 Feb 20, 2025
b5f487d
docs: :pencil2: need to use mkdir to make the folder
lwjohnst86 Feb 24, 2025
1fb3537
Merge branch 'main' into docs/guide-for-managing-resources
signekb Feb 28, 2025
6a6df8b
docs: :pencil2: suggestions from review
lwjohnst86 Mar 6, 2025
3aa3bdd
chore(pre-commit): :pencil2: automatic fixes
pre-commit-ci[bot] Mar 6, 2025
dd79dd7
Merge branch 'main' of https://github.com/seedcase-project/seedcase-s…
lwjohnst86 Mar 6, 2025
e9a4d9e
docs: :memo: address comments from review
lwjohnst86 Mar 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
322 changes: 322 additions & 0 deletions docs/guide/resources.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,322 @@
---
title: "Creating and managing data resources"
order: 2
jupyter: python3
execute:
eval: false
---

In each [data package](/docs/design/interface/outputs.qmd) are [data
resources](/docs/design/interface/outputs.qmd), which contain a
conceptually standalone set of data. This page shows you how to create
and manage data resources inside a data package using Sprout. We assume
that a data package has already been [created](packages.qmd).

{{< include _preamble.qmd >}}

::: callout-important
Data resources can only be created from [tidy
data](https://design.seedcase-project.org/data/). Before you can store
it, you need to process it into a tidy format, ideally using Python so
that you have a record of the steps taken to clean and transform the
data.
:::

```{python setup}
#| include: false
# This `setup` code chunk loads packages and prepares the data.
import seedcase_sprout.core as sp
import tempfile
from urllib.request import urlretrieve

temp_path = tempfile.TemporaryDirectory()
package_path = sp.create_package_properties(
properties=sp.example_package_properties(),
path=temp_path / "diabetes-study"
)
readme = sp.build_readme_text(sp.example_package_properties())
sp.write_text(readme, package_path.parent)

# Since the path leads to the datapackage.json file, for later functions we need the folder instead.
package_path = package_path.parent

# TODO: Maybe eventually move this over into Sprout as an example dataset, rather than via a URL.
# Download the example data and save to a data-raw folder in the temp path.
url = "https://raw.githubusercontent.com/seedcase-project/data/refs/heads/main/patients/patients.csv"
raw_data_path = temp_path / "patients.csv"
urlretrieve(
url,
raw_data_path
)
```

Making a data resource requires that you actually have data that can be
a resource in the first place. Generated or collected data always starts
out in a bit of a "raw" shape that needs some working. For this guide,
we have a raw (but fake) data file that we've already made tidy and that
looks like:

```{python}
#| echo: false
with open(raw_data_path, "r") as f:
print(f.read())
```

We've saved this data file in a path object called `raw_data_path`:

```{python}
print(raw_data_path)
```

Putting your raw data into a data
package makes it easier for yourself and others to use later one. So
the steps we'll take to get this raw data into the structure offered by
Sprout are:

1. Create the properties for the resource, using the original raw data
as a starting point and edit as needed.
2. Create a folder to store the (processed) data resource in our
package, as well as having a folder for the (tidy) raw data.
3. Save the properties of and path to the new data resource
into the `datapackage.json` file.
4. Re-build the data package's `README.md` file from the updated
`datapackage.json` file.
5. If you need to edit the properties at a later point, you can use
`edit_resource_properties()` and then re-build the
`datapackage.json` file.

Before we start, we need to import Sprout as well as other
helper packages:

```{python}
import seedcase_sprout.core as sp

# For pretty printing of output
from pprint import pprint

# TODO: This could be a wrapper helper function instead
# To be able to write multiline strings without indentation
from textwrap import dedent
```

## Extract resource properties from raw data

Because the resource's properties are useful for many later functions,
let's first get that created and ready to go. While you can create a
resource properties object manually using `ResourceProperties`, it can be
quite intensive and time-consuming if you for example have many columns in your data. The better and easier approach is to
extract as much information as possible from the raw data to create an
initial resource properties object with
`extract_resource_properties()`. Then, you can edit the properties as
needed.

Let's start with extracting the resource properties from the raw data.
The function is fairly good at getting and guessing the right
information, but it is very far from perfect and it cannot guess things that are not in the data itself, like a description of what the data contains or the unit of the data.

```{python}
resource_properties = extract_resource_properties(
data_path=raw_data_path
)
pprint(resource_properties)
```

You may be able to see that some things are missing, for instance, the
individual columns (called fields) don't have any descriptions. We have to
manually add this ourselves.
We can run a check on the properties to confirm what is missing:

```{python}
#| error: true
print(sp.check_resource_properties(resource_properties))
```

Let's fill in the description for all
the fields in the resource:

```{python}
# TODO: Need to consider/design how editing can be done in an easier, user-friendly way.
# TODO: Add more detail when we know what can and can't be extracted.
```

## Creating a data resource

Now that we have the properties for the resource, we can create the
resource itself within the data package. What this means is that we want
a folder for the specific resource (since we may have more data
resources to add).

Our package has already been created (using the steps from the [package
guide](packages.qmd)), with the path set as the variable `package_path`:

```{python}
print(package_path)
```

We can look inside that path to see the current files and folders:

```{python}
print(list(package_path.glob("*")))
```

This shows that the data package already includes a `datapackage.json` file and a `README.md` file. Now, we will create the resource structure in
this package:

```{python}
resource_paths=sp.create_resource_structure(
path=package_path
)
print(resource_paths)
```

With the the resource folder structure created, we are now ready to fill it
with our raw data! Next, we'll set up the resource properties so that it
is ready to be saved into the `datapackage.json` file. We can use the
`path_resource()` helper function to always give us the correct location
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, it might make sense to have the extract_resource_properties() function not save a path. Otherwise, this could be a step that is easily forgotten - and the checks won’t catch it bc it’s not missing What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure what you mean here. Could you expand?

to the specific resource's folder path. In this case, our resource is
the first one in the package, so we can use `path_resource(1)`.

```{python}
resource_properties = sp.create_resource_properties(
properties=resource_properties,
path=package_path / sp.path_resource(1)
)
pprint(resource_properties)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the extract_resource_properties() function has already created a ResourceProperties object, I feel like the need for this function has become a bit blurry to me. Couldn't we use the edit function to add the path instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I'm not super happy with the language. edit_resource_properties() edits properties *that already exist in the datapackage.json file. But those properties haven't been added to the datapackage.json file yet, we've only extracted the properties from the raw data.

```

::: callout-tip
If you want to see the list of resources available in your data package
via Python code (rather than looking at it directly in the file system),
you can use the `list_resources()` function.

```{python}
#| eval: false
print(sp.list_resources())
```
:::

This has set up the properties to be ready to add to the
`datapackage.json` file. Next, we save that properties file into the
`datapackage.json` file by writing it to the `datapackage.json` file:

```{python}
sp.write_resource_properties(
properties=resource_properties,
path=sp.path_properties()
)
```

We can check the contents of the `datapackage.json` file to see that the
resource properties have been added:

```{python}
pprint(sp.read_properties(package_path / sp.path_properties()))
```

## Storing a backup of the raw data
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder that based on #1078, this section needs an update :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raw —> batch


Before we start processing the raw data into a Parquet file, it is a
good idea to store a backup of the raw data. This is useful if you need
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In relation to Sprout this isn’t really just “a good idea” but necessary, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this.

to re-process the data at a later point, troubleshoot any issues, update
incorrect values, or if you need to compare the stored raw data to your
original raw data.

As we showed above, the data is stored in the path that we've set as
`raw_data_path`. We can store this data in the resource's folder by
using:

```{python}
sp.write_resource_data_to_raw(
data_path=raw_data_path,
resource_properties=resource_properties
)
```

This function uses the properties object to determine where to store the
raw data, which is in the `raw/` folder of the resource's folder. We can
check the newly added file by using:

```{python}
print(sp.path_resource_raw_files(1))
```

## Building the Parquet data resource file

Now that we've stored the raw data file, we can build the Parquet file
that will be used as the data resource. This Parquet file is built from
the raw data file that we've stored in the resource's folder.

```{python}
parquet_path = sp.build_resource_parquet(
raw_files=sp.path_resource_raw_files(1),
path=sp.path_resource_data(1)
)
print(parquet_path)
```

::: callout-tip
If you add more raw data to the resource later on, you can update this Parquet file to include all data in the raw folder using the `build_resource_parquet()` function like shown above.
:::

## Re-building the README file

One of the last steps to finish adding a new data resource is to
re-build the `README.md` file for the data package. To allow some
flexibility with what gets added to the README text, this next function
will only *build the text*, but not write it to the file. This allows
you to add additional information to the README text before writing it
to the file.

```{python}
readme_text = sp.build_readme_text(
properties=sp.read_properties(package_path / sp.path_properties())
)
```

In this case, we don't want to add anything else, so we'll write the
text to the `README.md` file:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn’t this be a nice to show how this would be done, actually?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean. Could you expand?


```{python}
sp.write_text(
text=readme_text,
# TODO: Make a helper function for this path?
path=package_path / "README.md"
)
```

## Edit resource properties

After having created a resource, you may need to make edits to the
properties. While technically you can do this manually by opening up the
`datapackage.json` file and editing it, we've made these functions to
help do it in an easier way that ensures that the `datapackage.json` is still in a correct json format. Using the
`edit_resource_properties()` function, you give it the path to the
current properties and then create a new `ResourceProperties` object
with any changes you want to make. Anything in the new properties object
will overwrite fields in the old properties object. This function does
not write back, it only returns the new properties object.

```{python}
resource_properties = sp.edit_resource_properties(
# Helper function
path=sp.path_properties(),
properties=sp.ResourceProperties(
title="Basic characteristics of patients"
)
)
pprint(resource_properties)
```

To write back, you use the `write_resource_properties()` function:

```{python}
sp.write_resource_properties(
properties=resource_properties,
path=sp.path_properties()
)
```

```{python}
#| include: false
temp_path.cleanup()
```