Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: 📝 add docs for data types #1098
base: main
Are you sure you want to change the base?
docs: 📝 add docs for data types #1098
Changes from 1 commit
64f8e69
45bfaaa
3ba7826
26e6dc6
5a9bdc2
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the end of this, we could add a little sentence to give an overview of what comes next. Something like: “ […] In the following sections, we’ll unfold how the Frictionless data types are mapped to Polars (that uses (Py)Arrow as its engine) to Parquet as well as which of these data types are supported by Sprout.”
(feel free to rewrite it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Data Package actually calls the “data types” for “field types”, maybe that should be referenced here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure what is meant by “as used in”? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like, we discuss/define each Frictionless data type with an emphasis on how it's used in Sprout. So the definitions are not given for Frictionless data types exactly, but for their "Sprout-flavours".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on if we use Polars' own parquet engine or PyArrow in the implementation, I might remove the other column later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Parquet column, I first give the primitive type, then the logical type in brackets, if relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martonvago could you put that comment in the table caption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for the properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the possible formats for a
string
here. Yeah, you would set it in theFieldProperties
.When Sprout supports these, that just means that we check that the given value looks like an email, uuid, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm still not clear on this. So if there was a column called
user_email
in a data frame, those values inside that column would be checked for being emails? So we could check all the values in that column to confirm that they are emails?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, not exactly.
Let's say they give resource properties like:
Now if they add some data to this resource, values in the column called
contact_address
will be checked for being emails, because that field hasformat
set toemail
in the properties. Values in the columnfirst_name
will not be checked against a specific format, becauseformat
is not set for this field in the properties.So whether we do format checks for a column depends on what's in the properties and not on the name of the column or anything in the data itself. And the check is done completely independently of saving the values to Parquet. They will be saved as a string in any case. Without the properties it will not be possible to tell that a particular column should be in email format.
(I'm happy to drop this bit and say we don't support formats for
string
type values if it's too confusing. Especially because these formats are not always easy/possible to check. We can have a regex for checking that an email looks roughly like an email, but the full email regex is so long that we probably don't want to use it. Then,uri
is such a broad category that even some validation libraries accept any string as an URI...)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixing ... is not allowed
Subject is "mixing"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lwjohnst86 Just a further comment on negative dates:
It seems like whether negative dates are supported depends on the tool used to interact with Parquet files.
So Polars actually supports negative dates, but e.g. PyArrow and fastparquet don't.
In practice this means that you can write a negative date into a Parquet file using Polars, but there are no guarantees that that date will be read back correctly when not using Polars. So, e.g., you write
-1001-01-01
with Polars, someone tries to read your Parquet file with PyArrow, and they get an error. Parquet files can be shared and used outside of Sprout, so we don't really know how people will read their data.So if we expect Parquet files produced by Sprout to be read only with Polars, we can support negative dates. Otherwise we better not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fascinating! I think for now we can say we don't support negative dates... or that we don't guarantee anything 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To save it as a
date
we can add a dummy day value like...-01
. Thenyearmonth
s will at least be sortable. The other option is to save it asstring
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, so we won’t store
yearmonth
asYYYY-MM
but add a day value?And how come it’s not sortable? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can store it as YYYY-MM but then it has to be a string, because there is no date-based data type for yearmonth in Polars/Parquet. I thiiiink sorting that alphabetically would sort them in the correct chronological order, as long as the number of year digits is fixed and there are no negatives.
The advantage of storing it as a proper date is that we / whoever analyses the data can perform all date operations on the column.
I don't mind which way!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, thanks for elaborating! I think it makes (the most) sense to add a date (like
-01
) then :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interval
types that are quite close matches for Frictionless'duration
. However, Arrow'sinterval
is, strangely, not converted to Parquet'sinterval
. It seems like it cannot be written to Parquet at all.interval
as a data type directly (using the tools we're using).duration
in Polars or Arrow. This has the same name as the Frictionless type, but it's actually just a number, e.g. the number of seconds. There is no unambiguous conversion from Frictionless'duration
to a number. E.g., if the duration is 1 month, even the number of days is not obvious.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now, we just say, we don't support dealing with duration values. And instead suggest making a "start" and "end" columns. Or as you suggest, keep as a string and have it in the documentation that we don't do actual duration data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's a good idea:
duration
, but keep it as a string -- just in case that's useful to someoneThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this as a tip here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked this up and the convention seems to be first latitude then longitude, with lots of confusion in software.
Frictionless (of course) gives it as
LONG, LAT
.Which way should we have it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lwjohnst86 This is the only thing that needs a decision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lwjohnst86 And this is the last thing to iron out 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow and Parquet have a JSON (logical) type, but using this doesn't seem to have any effect? It doesn't trigger any kind of JSON validation and saves bad JSON happily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's totally fine to say we don't do JSON (at least not yet)