Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: 📝 add docs for data types #1098

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

docs: 📝 add docs for data types #1098

wants to merge 5 commits into from

Conversation

martonvago
Copy link
Contributor

@martonvago martonvago commented Feb 27, 2025

Description

This PR adds a document that maps Frictionless data types to Parquet data types. It also has a go at describing how / to what extent each data type is supported by Sprout.

When reading this, it's worth keeping in mind that the flow of data through Sprout is something like:

textual data in Frictionless format (e.g. CSV) > Polars ( > PyArrow) > Parquet

where PyArrow is a library for working with columnar data, including Parquet. Polars can either use PyArrow when working with Parquet files or its own engine. This is relevant because the engine used will determine what data types we can map to and how that mapping/conversion will be done. We can choose the engine. Generally, PyArrow data types seem to be closer to Parquet data types.

Closes #1090

(No line breaks in the markdown, alas.)

This PR needs an in-depth review.

Checklist

  • Added or updated tests
  • Updated documentation
  • Ran just run-all

@martonvago martonvago self-assigned this Feb 27, 2025
@martonvago martonvago marked this pull request as ready for review February 28, 2025 07:48
@martonvago martonvago requested a review from a team as a code owner February 28, 2025 07:48
Copy link
Member

@lwjohnst86 lwjohnst86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice! Some small suggestions, mostly formatting.


A sequence of UTF-8 encoded characters.

Supported formats: `default`, `email`, `uri`, `binary`, `uuid`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for the properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the possible formats for a string here. Yeah, you would set it in the FieldProperties.
When Sprout supports these, that just means that we check that the given value looks like an email, uuid, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm still not clear on this. So if there was a column called user_email in a data frame, those values inside that column would be checked for being emails? So we could check all the values in that column to confirm that they are emails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, not exactly.
Let's say they give resource properties like:

ResourceProperties(
    schema=TableSchemaProperties(
        fields=[
            FieldProperties(name="contact_address", type="string", format="email"),
            FieldProperties(name="first_name", type="string"),
        ]
    )
)

Now if they add some data to this resource, values in the column called contact_address will be checked for being emails, because that field has format set to email in the properties. Values in the column first_name will not be checked against a specific format, because format is not set for this field in the properties.

So whether we do format checks for a column depends on what's in the properties and not on the name of the column or anything in the data itself. And the check is done completely independently of saving the values to Parquet. They will be saved as a string in any case. Without the properties it will not be possible to tell that a particular column should be in email format.

(I'm happy to drop this bit and say we don't support formats for string type values if it's too confusing. Especially because these formats are not always easy/possible to check. We can have a regex for checking that an email looks roughly like an email, but the full email regex is so long that we probably don't want to use it. Then, uri is such a broad category that even some validation libraries accept any string as an URI...)

Copy link
Contributor Author

@martonvago martonvago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to submit my comments, sorry!!

| `array` | `String` | `string` | `BYTE_ARRAY` (String) |
| `any` | `String` | `string` | `BYTE_ARRAY` (String) |

: Mappings of Frictionless data types in Sprout
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on if we use Polars' own parquet engine or PyArrow in the implementation, I might remove the other column later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Parquet column, I first give the primitive type, then the logical type in brackets, if relevant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martonvago could you put that comment in the table caption?


Example: `2022-12`.

The underlying representation of `yearmonth` is `date`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To save it as a date we can add a dummy day value like ...-01. Then yearmonths will at least be sortable. The other option is to save it as string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, so we won’t store yearmonth as YYYY-MM but add a day value?
And how come it’s not sortable? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can store it as YYYY-MM but then it has to be a string, because there is no date-based data type for yearmonth in Polars/Parquet. I thiiiink sorting that alphabetically would sort them in the correct chronological order, as long as the number of year digits is fixed and there are no negatives.

The advantage of storing it as a proper date is that we / whoever analyses the data can perform all date operations on the column.

I don't mind which way!

Copy link
Member

@signekb signekb Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, thanks for elaborating! I think it makes (the most) sense to add a date (like -01) then :)


**Restrictions:**

- The underlying representation of `duration` is `string`. Sprout does not attempt to parse `duration` values into a data type that is aware of the various time units contained within a `duration` value.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Both Arrow and Parquet have interval types that are quite close matches for Frictionless' duration. However, Arrow's interval is, strangely, not converted to Parquet's interval. It seems like it cannot be written to Parquet at all.
  • I don't think it's possible for us to choose Parquet's interval as a data type directly (using the tools we're using).
  • This leaves us with a type like duration in Polars or Arrow. This has the same name as the Frictionless type, but it's actually just a number, e.g. the number of seconds. There is no unambiguous conversion from Frictionless' duration to a number. E.g., if the duration is 1 month, even the number of days is not obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now, we just say, we don't support dealing with duration values. And instead suggest making a "start" and "end" columns. Or as you suggest, keep as a string and have it in the documentation that we don't do actual duration data types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's a good idea:

  • Allow Frictionless' duration, but keep it as a string -- just in case that's useful to someone
  • Suggest representing duration either with "start" and "end" columns or with a simple number, depending on the kind of data they have

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this as a tip here


A geographic point.

Expected format: `LAT, LONG`. The space is optional.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked this up and the convention seems to be first latitude then longitude, with lots of confusion in software.
Frictionless (of course) gives it as LONG, LAT.
Which way should we have it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwjohnst86 This is the only thing that needs a decision

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwjohnst86 And this is the last thing to iron out 😊


A JSON array. Must be well-formed [JSON](http://json.org/).

The underlying representation is `string`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow and Parquet have a JSON (logical) type, but using this doesn't seem to have any effect? It doesn't trigger any kind of JSON validation and saves bad JSON happily.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's totally fine to say we don't do JSON (at least not yet)

Co-authored-by: Luke W. Johnston <lwjohnst86@users.noreply.github.com>
Copy link
Member

@lwjohnst86 lwjohnst86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just some more comments and very minor formatting fixes.

| `array` | `String` | `string` | `BYTE_ARRAY` (String) |
| `any` | `String` | `string` | `BYTE_ARRAY` (String) |

: Mappings of Frictionless data types in Sprout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martonvago could you put that comment in the table caption?


A sequence of UTF-8 encoded characters.

Supported formats: `default`, `email`, `uri`, `binary`, `uuid`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm still not clear on this. So if there was a column called user_email in a data frame, those values inside that column would be checked for being emails? So we could check all the values in that column to confirm that they are emails?


A JSON array. Must be well-formed [JSON](http://json.org/).

The underlying representation is `string`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's totally fine to say we don't do JSON (at least not yet)

martonvago and others added 2 commits March 3, 2025 13:02
Co-authored-by: Luke W. Johnston <lwjohnst86@users.noreply.github.com>
@martonvago martonvago requested a review from lwjohnst86 March 3, 2025 13:33
lwjohnst86
lwjohnst86 previously approved these changes Mar 4, 2025
Copy link
Member

@lwjohnst86 lwjohnst86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! We'll just get @signekb review then merge it in!

**Restrictions:**

- Setting a custom `date` pattern in the `format` property is not yet supported. The `any` format is not supported.
- Negative `date` values are not supported.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwjohnst86 Just a further comment on negative dates:
It seems like whether negative dates are supported depends on the tool used to interact with Parquet files.
So Polars actually supports negative dates, but e.g. PyArrow and fastparquet don't.

In practice this means that you can write a negative date into a Parquet file using Polars, but there are no guarantees that that date will be read back correctly when not using Polars. So, e.g., you write -1001-01-01 with Polars, someone tries to read your Parquet file with PyArrow, and they get an error. Parquet files can be shared and used outside of Sprout, so we don't really know how people will read their data.

So if we expect Parquet files produced by Sprout to be read only with Polars, we can support negative dates. Otherwise we better not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinating! I think for now we can say we don't support negative dates... or that we don't guarantee anything 😛

@signekb
Copy link
Member

signekb commented Mar 6, 2025

I’m reviewing this currently 🌱

Copy link
Member

@signekb signekb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice - and it seems like a lot of work has gone into this 🎉
I just have some minor suggestions and a couple of questions for my own understanding. Nice work!

---


Sprout implements the Frictionless Data Package standard and aims to support the [data types](https://datapackage.org/standard/table-schema/#field-types) it defines. However, Sprout not only describes data with metadata but also transforms it into a tidy Parquet file, ready for querying (see [Outputs](/docs/design/interface/outputs.qmd#files) and [Why Parquet](https://decisions.seedcase-project.org/why-parquet/) for more details). As a result, Sprout supports only data types that are compatible (or can be made compatible) with Parquet storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of this, we could add a little sentence to give an overview of what comes next. Something like: “ […] In the following sections, we’ll unfold how the Frictionless data types are mapped to Polars (that uses (Py)Arrow as its engine) to Parquet as well as which of these data types are supported by Sprout.”
(feel free to rewrite it)


Sprout implements the Frictionless Data Package standard and aims to support the [data types](https://datapackage.org/standard/table-schema/#field-types) it defines. However, Sprout not only describes data with metadata but also transforms it into a tidy Parquet file, ready for querying (see [Outputs](/docs/design/interface/outputs.qmd#files) and [Why Parquet](https://decisions.seedcase-project.org/why-parquet/) for more details). As a result, Sprout supports only data types that are compatible (or can be made compatible) with Parquet storage.

Below, we list Frictionless data types as used in Sprout and give a precise definition for each. Any differences from the Frictionless specification are noted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Below, we list Frictionless data types as used in Sprout and give a precise definition for each. Any differences from the Frictionless specification are noted.
Below, we list Frictionless data types used Sprout and give a precise definition for each. Any differences from the Frictionless specification are noted.

I’m not sure what is meant by “as used in”? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like, we discuss/define each Frictionless data type with an emphasis on how it's used in Sprout. So the definitions are not given for Frictionless data types exactly, but for their "Sprout-flavours".


## Boolean

One of two possible values: true or false. Sprout supports the default notation for truth values in Frictionless:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One of two possible values: true or false. Sprout supports the default notation for truth values in Frictionless:
One of two possible values: `true` or `false`. Sprout supports the default notation for truth values in Frictionless:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the idea, but maybe double quotes? Because here I kinda mean the conceptual categories of truthiness and falsy-ness, not any concrete code values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Double quotes would be fine for me too 👍

Comment on lines +68 to +69
- All values in `["true", "True", "TRUE", "1"]` are interpreted as true.
- All values in `["false", "False", "FALSE", "0"]` are interpreted as false.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- All values in `["true", "True", "TRUE", "1"]` are interpreted as true.
- All values in `["false", "False", "FALSE", "0"]` are interpreted as false.
- All values in `["true", "True", "TRUE", "1"]` are interpreted as `true`.
- All values in `["false", "False", "FALSE", "0"]` are interpreted as `false`.

Comment on lines +129 to +131
- Expected format: `YYYY`. Negative `year` values are allowed.
- Examples: `2022`, `-1000`, `0005`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add table with examples. Then one of the examples would be a negative year as well.


Example: `2022-12`.

The underlying representation of `yearmonth` is `date`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, so we won’t store yearmonth as YYYY-MM but add a day value?
And how come it’s not sortable? 🤔

Comment on lines +176 to +177
- Expected format: `LAT, LONG`. The space is optional.
- Examples: `45.50, 90.50`, `45.50,90.50`, `-45.50, -90.50`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be a table as well? :)

signekb pushed a commit that referenced this pull request Mar 6, 2025
…#1106)

## Description

This PR adds the mapping from Frictionless data types to Polars data
types described in
#1098.

I'm adding this now because I'm using it in the checks and presumably
Signe will use it when writing to Parquet.

<!-- Select quick/in-depth as necessary -->
This PR needs a quick review.

## Checklist

- [x] Added or updated tests
- [x] Updated documentation
- [x] Ran `just run-all`
Co-authored-by: Signe Kirk Brødbæk <40836345+signekb@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Define data types supported by Sprout
3 participants