-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: 📝 add docs for data types #1098
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice! Some small suggestions, mostly formatting.
docs/design/interface/data-types.qmd
Outdated
|
||
A sequence of UTF-8 encoded characters. | ||
|
||
Supported formats: `default`, `email`, `uri`, `binary`, `uuid`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for the properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the possible formats for a string
here. Yeah, you would set it in the FieldProperties
.
When Sprout supports these, that just means that we check that the given value looks like an email, uuid, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm still not clear on this. So if there was a column called user_email
in a data frame, those values inside that column would be checked for being emails? So we could check all the values in that column to confirm that they are emails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, not exactly.
Let's say they give resource properties like:
ResourceProperties(
schema=TableSchemaProperties(
fields=[
FieldProperties(name="contact_address", type="string", format="email"),
FieldProperties(name="first_name", type="string"),
]
)
)
Now if they add some data to this resource, values in the column called contact_address
will be checked for being emails, because that field has format
set to email
in the properties. Values in the column first_name
will not be checked against a specific format, because format
is not set for this field in the properties.
So whether we do format checks for a column depends on what's in the properties and not on the name of the column or anything in the data itself. And the check is done completely independently of saving the values to Parquet. They will be saved as a string in any case. Without the properties it will not be possible to tell that a particular column should be in email format.
(I'm happy to drop this bit and say we don't support formats for string
type values if it's too confusing. Especially because these formats are not always easy/possible to check. We can have a regex for checking that an email looks roughly like an email, but the full email regex is so long that we probably don't want to use it. Then, uri
is such a broad category that even some validation libraries accept any string as an URI...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to submit my comments, sorry!!
docs/design/interface/data-types.qmd
Outdated
| `array` | `String` | `string` | `BYTE_ARRAY` (String) | | ||
| `any` | `String` | `string` | `BYTE_ARRAY` (String) | | ||
|
||
: Mappings of Frictionless data types in Sprout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on if we use Polars' own parquet engine or PyArrow in the implementation, I might remove the other column later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Parquet column, I first give the primitive type, then the logical type in brackets, if relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martonvago could you put that comment in the table caption?
|
||
Example: `2022-12`. | ||
|
||
The underlying representation of `yearmonth` is `date`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To save it as a date
we can add a dummy day value like ...-01
. Then yearmonth
s will at least be sortable. The other option is to save it as string
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, so we won’t store yearmonth
as YYYY-MM
but add a day value?
And how come it’s not sortable? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can store it as YYYY-MM but then it has to be a string, because there is no date-based data type for yearmonth in Polars/Parquet. I thiiiink sorting that alphabetically would sort them in the correct chronological order, as long as the number of year digits is fixed and there are no negatives.
The advantage of storing it as a proper date is that we / whoever analyses the data can perform all date operations on the column.
I don't mind which way!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, thanks for elaborating! I think it makes (the most) sense to add a date (like -01
) then :)
|
||
**Restrictions:** | ||
|
||
- The underlying representation of `duration` is `string`. Sprout does not attempt to parse `duration` values into a data type that is aware of the various time units contained within a `duration` value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Both Arrow and Parquet have
interval
types that are quite close matches for Frictionless'duration
. However, Arrow'sinterval
is, strangely, not converted to Parquet'sinterval
. It seems like it cannot be written to Parquet at all. - I don't think it's possible for us to choose Parquet's
interval
as a data type directly (using the tools we're using). - This leaves us with a type like
duration
in Polars or Arrow. This has the same name as the Frictionless type, but it's actually just a number, e.g. the number of seconds. There is no unambiguous conversion from Frictionless'duration
to a number. E.g., if the duration is 1 month, even the number of days is not obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now, we just say, we don't support dealing with duration values. And instead suggest making a "start" and "end" columns. Or as you suggest, keep as a string and have it in the documentation that we don't do actual duration data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's a good idea:
- Allow Frictionless'
duration
, but keep it as a string -- just in case that's useful to someone - Suggest representing duration either with "start" and "end" columns or with a simple number, depending on the kind of data they have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this as a tip here
docs/design/interface/data-types.qmd
Outdated
|
||
A geographic point. | ||
|
||
Expected format: `LAT, LONG`. The space is optional. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked this up and the convention seems to be first latitude then longitude, with lots of confusion in software.
Frictionless (of course) gives it as LONG, LAT
.
Which way should we have it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lwjohnst86 This is the only thing that needs a decision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lwjohnst86 And this is the last thing to iron out 😊
docs/design/interface/data-types.qmd
Outdated
|
||
A JSON array. Must be well-formed [JSON](http://json.org/). | ||
|
||
The underlying representation is `string`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow and Parquet have a JSON (logical) type, but using this doesn't seem to have any effect? It doesn't trigger any kind of JSON validation and saves bad JSON happily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's totally fine to say we don't do JSON (at least not yet)
Co-authored-by: Luke W. Johnston <lwjohnst86@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Just some more comments and very minor formatting fixes.
docs/design/interface/data-types.qmd
Outdated
| `array` | `String` | `string` | `BYTE_ARRAY` (String) | | ||
| `any` | `String` | `string` | `BYTE_ARRAY` (String) | | ||
|
||
: Mappings of Frictionless data types in Sprout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martonvago could you put that comment in the table caption?
docs/design/interface/data-types.qmd
Outdated
|
||
A sequence of UTF-8 encoded characters. | ||
|
||
Supported formats: `default`, `email`, `uri`, `binary`, `uuid`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm still not clear on this. So if there was a column called user_email
in a data frame, those values inside that column would be checked for being emails? So we could check all the values in that column to confirm that they are emails?
docs/design/interface/data-types.qmd
Outdated
|
||
A JSON array. Must be well-formed [JSON](http://json.org/). | ||
|
||
The underlying representation is `string`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's totally fine to say we don't do JSON (at least not yet)
Co-authored-by: Luke W. Johnston <lwjohnst86@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! We'll just get @signekb review then merge it in!
**Restrictions:** | ||
|
||
- Setting a custom `date` pattern in the `format` property is not yet supported. The `any` format is not supported. | ||
- Negative `date` values are not supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lwjohnst86 Just a further comment on negative dates:
It seems like whether negative dates are supported depends on the tool used to interact with Parquet files.
So Polars actually supports negative dates, but e.g. PyArrow and fastparquet don't.
In practice this means that you can write a negative date into a Parquet file using Polars, but there are no guarantees that that date will be read back correctly when not using Polars. So, e.g., you write -1001-01-01
with Polars, someone tries to read your Parquet file with PyArrow, and they get an error. Parquet files can be shared and used outside of Sprout, so we don't really know how people will read their data.
So if we expect Parquet files produced by Sprout to be read only with Polars, we can support negative dates. Otherwise we better not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fascinating! I think for now we can say we don't support negative dates... or that we don't guarantee anything 😛
I’m reviewing this currently 🌱 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice - and it seems like a lot of work has gone into this 🎉
I just have some minor suggestions and a couple of questions for my own understanding. Nice work!
docs/design/interface/data-types.qmd
Outdated
--- | ||
|
||
|
||
Sprout implements the Frictionless Data Package standard and aims to support the [data types](https://datapackage.org/standard/table-schema/#field-types) it defines. However, Sprout not only describes data with metadata but also transforms it into a tidy Parquet file, ready for querying (see [Outputs](/docs/design/interface/outputs.qmd#files) and [Why Parquet](https://decisions.seedcase-project.org/why-parquet/) for more details). As a result, Sprout supports only data types that are compatible (or can be made compatible) with Parquet storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the end of this, we could add a little sentence to give an overview of what comes next. Something like: “ […] In the following sections, we’ll unfold how the Frictionless data types are mapped to Polars (that uses (Py)Arrow as its engine) to Parquet as well as which of these data types are supported by Sprout.”
(feel free to rewrite it)
|
||
Sprout implements the Frictionless Data Package standard and aims to support the [data types](https://datapackage.org/standard/table-schema/#field-types) it defines. However, Sprout not only describes data with metadata but also transforms it into a tidy Parquet file, ready for querying (see [Outputs](/docs/design/interface/outputs.qmd#files) and [Why Parquet](https://decisions.seedcase-project.org/why-parquet/) for more details). As a result, Sprout supports only data types that are compatible (or can be made compatible) with Parquet storage. | ||
|
||
Below, we list Frictionless data types as used in Sprout and give a precise definition for each. Any differences from the Frictionless specification are noted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below, we list Frictionless data types as used in Sprout and give a precise definition for each. Any differences from the Frictionless specification are noted. | |
Below, we list Frictionless data types used Sprout and give a precise definition for each. Any differences from the Frictionless specification are noted. |
I’m not sure what is meant by “as used in”? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like, we discuss/define each Frictionless data type with an emphasis on how it's used in Sprout. So the definitions are not given for Frictionless data types exactly, but for their "Sprout-flavours".
|
||
## Boolean | ||
|
||
One of two possible values: true or false. Sprout supports the default notation for truth values in Frictionless: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of two possible values: true or false. Sprout supports the default notation for truth values in Frictionless: | |
One of two possible values: `true` or `false`. Sprout supports the default notation for truth values in Frictionless: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the idea, but maybe double quotes? Because here I kinda mean the conceptual categories of truthiness and falsy-ness, not any concrete code values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Double quotes would be fine for me too 👍
- All values in `["true", "True", "TRUE", "1"]` are interpreted as true. | ||
- All values in `["false", "False", "FALSE", "0"]` are interpreted as false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- All values in `["true", "True", "TRUE", "1"]` are interpreted as true. | |
- All values in `["false", "False", "FALSE", "0"]` are interpreted as false. | |
- All values in `["true", "True", "TRUE", "1"]` are interpreted as `true`. | |
- All values in `["false", "False", "FALSE", "0"]` are interpreted as `false`. |
- Expected format: `YYYY`. Negative `year` values are allowed. | ||
- Examples: `2022`, `-1000`, `0005`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add table with examples. Then one of the examples would be a negative year as well.
|
||
Example: `2022-12`. | ||
|
||
The underlying representation of `yearmonth` is `date`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, so we won’t store yearmonth
as YYYY-MM
but add a day value?
And how come it’s not sortable? 🤔
- Expected format: `LAT, LONG`. The space is optional. | ||
- Examples: `45.50, 90.50`, `45.50,90.50`, `-45.50, -90.50`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be a table as well? :)
…#1106) ## Description This PR adds the mapping from Frictionless data types to Polars data types described in #1098. I'm adding this now because I'm using it in the checks and presumably Signe will use it when writing to Parquet. <!-- Select quick/in-depth as necessary --> This PR needs a quick review. ## Checklist - [x] Added or updated tests - [x] Updated documentation - [x] Ran `just run-all`
Co-authored-by: Signe Kirk Brødbæk <40836345+signekb@users.noreply.github.com>
Description
This PR adds a document that maps Frictionless data types to Parquet data types. It also has a go at describing how / to what extent each data type is supported by Sprout.
When reading this, it's worth keeping in mind that the flow of data through Sprout is something like:
where PyArrow is a library for working with columnar data, including Parquet. Polars can either use PyArrow when working with Parquet files or its own engine. This is relevant because the engine used will determine what data types we can map to and how that mapping/conversion will be done. We can choose the engine. Generally, PyArrow data types seem to be closer to Parquet data types.
Closes #1090
(No line breaks in the markdown, alas.)
This PR needs an in-depth review.
Checklist
just run-all