Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: 📝 add docs for data types #1098

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
214 changes: 214 additions & 0 deletions docs/design/interface/data-types.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
---
title: "Data types"
description: "The data types Sprout supports"
---


Sprout implements the Frictionless Data Package standard and aims to support the [data types](https://datapackage.org/standard/table-schema/#field-types) it defines. However, Sprout not only describes data with metadata but also transforms it into a tidy Parquet file, ready for querying (see [Outputs](/docs/design/interface/outputs.qmd#files) and [Why Parquet](https://decisions.seedcase-project.org/why-parquet/) for more details). As a result, Sprout supports only data types that are compatible (or can be made compatible) with Parquet storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of this, we could add a little sentence to give an overview of what comes next. Something like: “ […] In the following sections, we’ll unfold how the Frictionless data types are mapped to Polars (that uses (Py)Arrow as its engine) to Parquet as well as which of these data types are supported by Sprout.”
(feel free to rewrite it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Data Package actually calls the “data types” for “field types”, maybe that should be referenced here?


Below, we list Frictionless data types as used in Sprout and give a precise definition for each. Any differences from the Frictionless specification are noted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Below, we list Frictionless data types as used in Sprout and give a precise definition for each. Any differences from the Frictionless specification are noted.
Below, we list Frictionless data types used Sprout and give a precise definition for each. Any differences from the Frictionless specification are noted.

I’m not sure what is meant by “as used in”? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like, we discuss/define each Frictionless data type with an emphasis on how it's used in Sprout. So the definitions are not given for Frictionless data types exactly, but for their "Sprout-flavours".


| Frictionless | Polars | Arrow | Parquet |
|--------------|---------------------|------------------------------|------------------------|
| `string` | `String` | `string` | `BYTE_ARRAY` (String) |
| `number` | `Float64` | `float64` | `DOUBLE` |
| `integer` | `Int64` | `int64` | `INT64` |
| `boolean` | `Boolean` | `bool_` | `BOOLEAN` |
| `datetime` | `Datetime` | `timestamp[ms, tz]` | `INT64` (Timestamp) |
| `date` | `Date` | `date32[day]` | `INT32` (Date) |
| `time` | `Time` | `time64[ns]` | `INT64` (Time) |
| `year` | `Int32` | `int32` | `INT32` |
| `yearmonth` | `Date` | `date32[day]` | `INT32` (Date) |
| `duration` | `String` | `string` | `BYTE_ARRAY` (String) |
| `geopoint` | `Array[Float64, 2]` | `fixed_size_list[double, 2]` | `FIXED_LEN_BYTE_ARRAY` |
| `geojson` | `String` | `string` | `BYTE_ARRAY` (String) |
| `object` | `String` | `string` | `BYTE_ARRAY` (String) |
| `array` | `String` | `string` | `BYTE_ARRAY` (String) |
| `any` | `String` | `string` | `BYTE_ARRAY` (String) |

: Mappings of Frictionless data types in Sprout
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on if we use Polars' own parquet engine or PyArrow in the implementation, I might remove the other column later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Parquet column, I first give the primitive type, then the logical type in brackets, if relevant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martonvago could you put that comment in the table caption?


## String

A sequence of UTF-8 encoded characters.

Supported formats: `default`, `email`, `uri`, `binary`, `uuid`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for the properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the possible formats for a string here. Yeah, you would set it in the FieldProperties.
When Sprout supports these, that just means that we check that the given value looks like an email, uuid, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm still not clear on this. So if there was a column called user_email in a data frame, those values inside that column would be checked for being emails? So we could check all the values in that column to confirm that they are emails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, not exactly.
Let's say they give resource properties like:

ResourceProperties(
    schema=TableSchemaProperties(
        fields=[
            FieldProperties(name="contact_address", type="string", format="email"),
            FieldProperties(name="first_name", type="string"),
        ]
    )
)

Now if they add some data to this resource, values in the column called contact_address will be checked for being emails, because that field has format set to email in the properties. Values in the column first_name will not be checked against a specific format, because format is not set for this field in the properties.

So whether we do format checks for a column depends on what's in the properties and not on the name of the column or anything in the data itself. And the check is done completely independently of saving the values to Parquet. They will be saved as a string in any case. Without the properties it will not be possible to tell that a particular column should be in email format.

(I'm happy to drop this bit and say we don't support formats for string type values if it's too confusing. Especially because these formats are not always easy/possible to check. We can have a regex for checking that an email looks roughly like an email, but the full email regex is so long that we probably don't want to use it. Then, uri is such a broad category that even some validation libraries accept any string as an URI...)


## Number

A number with or without a decimal part. Precision is not reflected in the number of digits specified in the decimal part (`2.0` is not distinct from `2.00` ).

Precision of up to 16 significant digits.

| Name | Allowed values | Examples |
|----------------------------|-----------------------------------------|----------------------------|
| decimal indicator | `.` | `12.56`, `50.000` |
| leading sign | `+`(default) or `-` | `12.56`, `+5.00`, `-12` |
| leading and trailing zeros | `0` (optional) | `12.56`, `0012`, `12.5600` |
| exponent | `E<sign><decimal digits>` | `5E10`, `12.56E-3` |
| special values | `NaN`, `inf`, `-inf` (case-insensitive) | |

: `number` format options

The configuration options `decimalChar`, `groupChar` and `bareNumber` are not yet supported.

## Integer

A whole number with no decimal part.

| Name | Allowed values | Examples |
|---------------|----------------|------------|
| leading sign | no sign or `-` | `5`, `-12` |
| leading zeros | `0` (optional) | `005` |

: `integer` format options

The configuration options `groupChar` and `bareNumber` are not yet supported.

## Boolean

One of two possible values: true or false.

Sprout supports the default notation for truth values in Frictionless:

- All values in `["true", "True", "TRUE", "1"]` are interpreted as true.
- All values in `["false", "False", "FALSE", "0"]` are interpreted as false.
Comment on lines +68 to +69
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- All values in `["true", "True", "TRUE", "1"]` are interpreted as true.
- All values in `["false", "False", "FALSE", "0"]` are interpreted as false.
- All values in `["true", "True", "TRUE", "1"]` are interpreted as `true`.
- All values in `["false", "False", "FALSE", "0"]` are interpreted as `false`.


Setting custom `trueValues` and `falseValues` is not yet supported.

## Datetime

A date with a time and optional timezone.

| Name | Allowed values | Examples |
|----------|----------------------------------------------|-----------------------------------------------|
| without milliseconds or timezone | `YYYY-MM-DDTHH:MM:SS` | `2002-10-12T12:04:15`, `0202-10-10T02:30:00` |
| with milliseconds | `YYYY-MM-DDTHH:MM:SS.sss` | `2002-10-12T12:04:15.3`, `0202-10-10T02:30:00.345` |
| with timezone | `YYYY-MM-DDTHH:MM:SS<sign>HH:MM` | `2002-10-12T12:04:15+05:00`, `0202-10-10T02:30:00-01:00` |
| with milliseconds and timezone | `YYYY-MM-DDTHH:MM:SS.sss<sign>HH:MM` | `2002-10-12T12:04:15.3+05:00`, `0202-10-10T02:30:00.345-01:00` |
| shorthand for UTC | `YYYY-MM-DDTHH:MM:SS(.sss)Z` | `2002-10-12T12:04:15Z`, `0202-10-10T02:30:00.345Z` |

: `datetime` format options

**Restrictions:**

- Setting a custom `datetime` pattern in the `format` property is not yet supported. The `any` format is not supported.
- Negative `datetime` values are not supported.
- Years with more than 4 digits are not supported.
- Mixing `datetime` values with and without a timezone in one column is not allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Mixing `datetime` values with and without a timezone in one column is not allowed.
- Mixing `datetime` values with and without a timezone in one column are not allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixing ... is not allowed
Subject is "mixing"

- When `datetime` values in a column have a timezone, they must all have the same timezone.

## Date

A date without a time.

- Expected format: `YYYY-MM-DD`.
- Example: `2022-12-09`.
Comment on lines +99 to +100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a table like the other data types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for year and yearmonth below


**Restrictions:**

- Setting a custom `date` pattern in the `format` property is not yet supported. The `any` format is not supported.
- Negative `date` values are not supported.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwjohnst86 Just a further comment on negative dates:
It seems like whether negative dates are supported depends on the tool used to interact with Parquet files.
So Polars actually supports negative dates, but e.g. PyArrow and fastparquet don't.

In practice this means that you can write a negative date into a Parquet file using Polars, but there are no guarantees that that date will be read back correctly when not using Polars. So, e.g., you write -1001-01-01 with Polars, someone tries to read your Parquet file with PyArrow, and they get an error. Parquet files can be shared and used outside of Sprout, so we don't really know how people will read their data.

So if we expect Parquet files produced by Sprout to be read only with Polars, we can support negative dates. Otherwise we better not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinating! I think for now we can say we don't support negative dates... or that we don't guarantee anything 😛

- Years with more than 4 digits are not supported.
- `date` values with a timezone are not supported.

## Time

A time without a date.

| Name | Allowed values | Examples |
|----------------------|-------------------|---------------------------------|
| without microseconds | `HH:MM:SS` | `12:04:15` |
| with microseconds | `HH:MM:SS.ssssss` | `12:04:15.3`, `02:30:00.345345` |

: `time` format options

**Restrictions:**

- Setting a custom `time` pattern in the `format` property is not yet supported. The `any` format is not supported.
- `time` values with a timezone are not supported.

## Year

A calendar year without month or day.

- Expected format: `YYYY`. Negative `year` values are allowed.
- Examples: `2022`, `-1000`, `0005`.

Comment on lines +129 to +131
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add table with examples. Then one of the examples would be a negative year as well.

**Restrictions:**

- `year` values with a timezone are not supported.

## Yearmonth

A specific month in a specific year.

- Expected format: `YYYY-MM`.
- Example: `2022-12`.

The underlying representation of `yearmonth` is `date`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To save it as a date we can add a dummy day value like ...-01. Then yearmonths will at least be sortable. The other option is to save it as string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, so we won’t store yearmonth as YYYY-MM but add a day value?
And how come it’s not sortable? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can store it as YYYY-MM but then it has to be a string, because there is no date-based data type for yearmonth in Polars/Parquet. I thiiiink sorting that alphabetically would sort them in the correct chronological order, as long as the number of year digits is fixed and there are no negatives.

The advantage of storing it as a proper date is that we / whoever analyses the data can perform all date operations on the column.

I don't mind which way!

Copy link
Member

@signekb signekb Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, thanks for elaborating! I think it makes (the most) sense to add a date (like -01) then :)


**Restrictions:**

- Negative `yearmonth` values are not supported.
- Years with more than 4 digits are not supported.
- `yearmonth` values with a timezone are not supported.

## Duration

A duration of time.

- Expected pattern: `PnYnMnDTnHnMnS`. See the [definition of the Frictionless type](https://datapackage.org/standard/table-schema/#duration) for more information.
- Example: `P1Y2M3DT10H30M45.343S`.

The number of seconds may include decimal digits to arbitrary precision.

**Restrictions:**

- The underlying representation of `duration` is `string`. Sprout does not attempt to parse `duration` values into a data type that is aware of the various time units contained within a `duration` value.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Both Arrow and Parquet have interval types that are quite close matches for Frictionless' duration. However, Arrow's interval is, strangely, not converted to Parquet's interval. It seems like it cannot be written to Parquet at all.
  • I don't think it's possible for us to choose Parquet's interval as a data type directly (using the tools we're using).
  • This leaves us with a type like duration in Polars or Arrow. This has the same name as the Frictionless type, but it's actually just a number, e.g. the number of seconds. There is no unambiguous conversion from Frictionless' duration to a number. E.g., if the duration is 1 month, even the number of days is not obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now, we just say, we don't support dealing with duration values. And instead suggest making a "start" and "end" columns. Or as you suggest, keep as a string and have it in the documentation that we don't do actual duration data types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's a good idea:

  • Allow Frictionless' duration, but keep it as a string -- just in case that's useful to someone
  • Suggest representing duration either with "start" and "end" columns or with a simple number, depending on the kind of data they have

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this as a tip here

- As a consequence, constraints relying on the numeric comparison of `duration` values are not supported. These constraints are: `minimum`, `maximum`, `exclusiveMinimum`, `exclusiveMaximum`.

## Geopoint

A geographic point.

- Expected format: `LAT, LONG`. The space is optional.
- Examples: `45.50, 90.50`, `45.50,90.50`, `-45.50, -90.50`.
Comment on lines +176 to +177
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be a table as well? :)


The underlying representation is an array of two `number`s.

**Restrictions:**

- Other `geopoint` formats are not yet supported.

## Array

A JSON array. Must be well-formed [JSON](http://json.org/).

The underlying representation is `string`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow and Parquet have a JSON (logical) type, but using this doesn't seem to have any effect? It doesn't trigger any kind of JSON validation and saves bad JSON happily.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's totally fine to say we don't do JSON (at least not yet)


## Object

A JSON object. Must be well-formed [JSON](http://json.org/).

The underlying representation is `string`.

## Geojson

A JSON object compliant with the [GeoJSON](http://geojson.org/) or [TopoJSON](https://github.com/topojson/topojson-specification/blob/master/README.md) specification.

The underlying representation is `string`.

**Restrictions:**

- `geojson` values are treated as plain `object`s. They are not checked against the GeoJSON or TopoJSON specification.

## Any

Unspecified or mixed values.

The underlying representation is `string`.

## List

Not supported.