Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: 📝 add docs for data types #1098

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
207 changes: 207 additions & 0 deletions docs/design/interface/data-types.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
---
title: "Data types"
description: "The data types Sprout supports"
---


Sprout implements the Frictionless Data Package standard and aims to support the [data types](https://datapackage.org/standard/table-schema/#field-types) it defines. However, Sprout not only describes data with its properties (i.e., metadata) but also transforms it into a tidy Parquet file, ready for querying (see [Outputs](/docs/design/interface/outputs.qmd#files) and [Why Parquet](https://decisions.seedcase-project.org/why-parquet/) for more details). As a result, Sprout supports only data types that are compatible (or can be made compatible) with Parquet storage.

Below, we list Frictionless data types as used in Sprout and give a precise definition for each. Any differences from the Frictionless specification are noted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Below, we list Frictionless data types as used in Sprout and give a precise definition for each. Any differences from the Frictionless specification are noted.
Below, we list Frictionless data types used Sprout and give a precise definition for each. Any differences from the Frictionless specification are noted.

I’m not sure what is meant by “as used in”? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like, we discuss/define each Frictionless data type with an emphasis on how it's used in Sprout. So the definitions are not given for Frictionless data types exactly, but for their "Sprout-flavours".


| Frictionless | Polars | Arrow | Parquet |
|--------------|---------------------|------------------------------|------------------------|
| `string` | `String` | `string` | `BYTE_ARRAY` (String) |
| `number` | `Float64` | `float64` | `DOUBLE` |
| `integer` | `Int64` | `int64` | `INT64` |
| `boolean` | `Boolean` | `bool_` | `BOOLEAN` |
| `datetime` | `Datetime` | `timestamp[ms, tz]` | `INT64` (Timestamp) |
| `date` | `Date` | `date32[day]` | `INT32` (Date) |
| `time` | `Time` | `time64[ns]` | `INT64` (Time) |
| `year` | `Int32` | `int32` | `INT32` |
| `yearmonth` | `Date` | `date32[day]` | `INT32` (Date) |
| `duration` | `String` | `string` | `BYTE_ARRAY` (String) |
| `geopoint` | `Array[Float64, 2]` | `fixed_size_list[double, 2]` | `FIXED_LEN_BYTE_ARRAY` |
| `geojson` | `String` | `string` | `BYTE_ARRAY` (String) |
| `object` | `String` | `string` | `BYTE_ARRAY` (String) |
| `array` | `String` | `string` | `BYTE_ARRAY` (String) |
| `any` | `String` | `string` | `BYTE_ARRAY` (String) |

: Mappings of Frictionless data types in Sprout. In the Parquet column, first the [primitive type](https://parquet.apache.org/docs/file-format/types/) is given, then the [logical type](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) in brackets, if relevant.

## String

A sequence of UTF-8 encoded characters. Sprout supports all Frictionless Data Package string formats: `default`, `email`, `uri`, `binary`, `uuid`.

## Number

A number with or without a decimal part. Precision is not reflected in the number of digits specified in the decimal part (`2.0` is not distinct from `2.00` ). Precision goes up to 16 significant digits.

| Name | Allowed values | Examples |
|----------------------------|-----------------------------------------|----------------------------|
| decimal indicator | `.` | `12.56`, `50.000` |
| leading sign | `+`(default) or `-` | `12.56`, `+5.00`, `-12` |
| leading and trailing zeros | `0` (optional) | `12.56`, `0012`, `12.5600` |
| exponent | `E<sign><decimal digits>` | `5E10`, `12.56E-3` |
| special values | `NaN`, `inf`, `-inf` (case-insensitive) | |

: `number` format options.

The configuration options `decimalChar`, `groupChar` and `bareNumber` are not yet supported.

## Integer

A whole number with no decimal part.

| Name | Allowed values | Examples |
|---------------|----------------|------------|
| leading sign | no sign or `-` | `5`, `-12` |
| leading zeros | `0` (optional) | `005` |

: `integer` format options.

The Frictionless Data Package configuration options `groupChar` and `bareNumber` are not yet supported by Sprout.

## Boolean

One of two possible values: true or false. Sprout supports the default notation for truth values in Frictionless:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One of two possible values: true or false. Sprout supports the default notation for truth values in Frictionless:
One of two possible values: `true` or `false`. Sprout supports the default notation for truth values in Frictionless:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the idea, but maybe double quotes? Because here I kinda mean the conceptual categories of truthiness and falsy-ness, not any concrete code values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Double quotes would be fine for me too 👍


- All values in `["true", "True", "TRUE", "1"]` are interpreted as true.
- All values in `["false", "False", "FALSE", "0"]` are interpreted as false.
Comment on lines +68 to +69
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- All values in `["true", "True", "TRUE", "1"]` are interpreted as true.
- All values in `["false", "False", "FALSE", "0"]` are interpreted as false.
- All values in `["true", "True", "TRUE", "1"]` are interpreted as `true`.
- All values in `["false", "False", "FALSE", "0"]` are interpreted as `false`.


Setting custom `trueValues` and `falseValues` is not yet supported by Sprout.

## Datetime

A date with a time and optional timezone.

| Name | Allowed values | Examples |
|----------|----------------------------------------------|-----------------------------------------------|
| without milliseconds or timezone | `YYYY-MM-DDTHH:MM:SS` | `2002-10-12T12:04:15`, `0202-10-10T02:30:00` |
| with milliseconds | `YYYY-MM-DDTHH:MM:SS.sss` | `2002-10-12T12:04:15.3`, `0202-10-10T02:30:00.345` |
| with timezone | `YYYY-MM-DDTHH:MM:SS<sign>HH:MM` | `2002-10-12T12:04:15+05:00`, `0202-10-10T02:30:00-01:00` |
| with milliseconds and timezone | `YYYY-MM-DDTHH:MM:SS.sss<sign>HH:MM` | `2002-10-12T12:04:15.3+05:00`, `0202-10-10T02:30:00.345-01:00` |
| with shorthand for UTC | `YYYY-MM-DDTHH:MM:SS(.sss)Z` | `2002-10-12T12:04:15Z`, `0202-10-10T02:30:00.345Z` |

: `datetime` format options.

**Restrictions:**

- Setting a custom `datetime` pattern in the `format` property in `datapackage.json` is not yet supported. The `any` format is not supported.
- Negative `datetime` values are not supported.
- Years with more than 4 digits are not supported.
- Mixing `datetime` values with and without a timezone in one column is not allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Mixing `datetime` values with and without a timezone in one column is not allowed.
- Mixing `datetime` values with and without a timezone in one column are not allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixing ... is not allowed
Subject is "mixing"

- When `datetime` values in a column have a timezone, they must all have the same timezone.

## Date

A date without a time.

- Expected format: `YYYY-MM-DD`.
- Example: `2022-12-09`.
Comment on lines +99 to +100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a table like the other data types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for year and yearmonth below


**Restrictions:**

- Setting a custom `date` pattern in the `format` property is not yet supported. The `any` format is not supported.
- Negative `date` values are not supported.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lwjohnst86 Just a further comment on negative dates:
It seems like whether negative dates are supported depends on the tool used to interact with Parquet files.
So Polars actually supports negative dates, but e.g. PyArrow and fastparquet don't.

In practice this means that you can write a negative date into a Parquet file using Polars, but there are no guarantees that that date will be read back correctly when not using Polars. So, e.g., you write -1001-01-01 with Polars, someone tries to read your Parquet file with PyArrow, and they get an error. Parquet files can be shared and used outside of Sprout, so we don't really know how people will read their data.

So if we expect Parquet files produced by Sprout to be read only with Polars, we can support negative dates. Otherwise we better not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinating! I think for now we can say we don't support negative dates... or that we don't guarantee anything 😛

- Years with more than 4 digits are not supported.
- `date` values with a timezone are not supported.

## Time

A time without a date.

| Name | Allowed values | Examples |
|----------------------|-------------------|---------------------------------|
| without microseconds | `HH:MM:SS` | `12:04:15` |
| with microseconds | `HH:MM:SS.ssssss` | `12:04:15.3`, `02:30:00.345345` |

: `time` format options.

**Restrictions:**

- Setting a custom `time` pattern in the `format` property is not yet supported. The `any` format is not supported.
- `time` values with a timezone are not supported.

## Year

A calendar year without month or day.

- Expected format: `YYYY`. Negative `year` values are allowed.
- Examples: `2022`, `-1000`, `0005`.

Comment on lines +129 to +131
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add table with examples. Then one of the examples would be a negative year as well.

**Restrictions:**

- `year` values with a timezone are not supported.

## Yearmonth

A specific month in a specific year.

- Expected format: `YYYY-MM`.
- Example: `2022-12`.

The underlying representation of `yearmonth` is `date`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To save it as a date we can add a dummy day value like ...-01. Then yearmonths will at least be sortable. The other option is to save it as string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, so we won’t store yearmonth as YYYY-MM but add a day value?
And how come it’s not sortable? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can store it as YYYY-MM but then it has to be a string, because there is no date-based data type for yearmonth in Polars/Parquet. I thiiiink sorting that alphabetically would sort them in the correct chronological order, as long as the number of year digits is fixed and there are no negatives.

The advantage of storing it as a proper date is that we / whoever analyses the data can perform all date operations on the column.

I don't mind which way!

Copy link
Member

@signekb signekb Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, thanks for elaborating! I think it makes (the most) sense to add a date (like -01) then :)


**Restrictions:**

- Negative `yearmonth` values are not supported.
- Years with more than 4 digits are not supported.
- `yearmonth` values with a timezone are not supported.

## Duration

A duration of time.

- Expected pattern: `PnYnMnDTnHnMnS`. See the [definition of the Frictionless type](https://datapackage.org/standard/table-schema/#duration) for more information.
- Example: `P1Y2M3DT10H30M45.343S`.

The number of seconds may include decimal digits to arbitrary precision.

**Restrictions:**

- The underlying representation of `duration` is `string`. Sprout does not attempt to parse `duration` values into a data type that is aware of the various time units contained within a `duration` value.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Both Arrow and Parquet have interval types that are quite close matches for Frictionless' duration. However, Arrow's interval is, strangely, not converted to Parquet's interval. It seems like it cannot be written to Parquet at all.
  • I don't think it's possible for us to choose Parquet's interval as a data type directly (using the tools we're using).
  • This leaves us with a type like duration in Polars or Arrow. This has the same name as the Frictionless type, but it's actually just a number, e.g. the number of seconds. There is no unambiguous conversion from Frictionless' duration to a number. E.g., if the duration is 1 month, even the number of days is not obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now, we just say, we don't support dealing with duration values. And instead suggest making a "start" and "end" columns. Or as you suggest, keep as a string and have it in the documentation that we don't do actual duration data types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's a good idea:

  • Allow Frictionless' duration, but keep it as a string -- just in case that's useful to someone
  • Suggest representing duration either with "start" and "end" columns or with a simple number, depending on the kind of data they have

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this as a tip here

- As a consequence, constraints relying on the numeric comparison of `duration` values are not supported. These constraints are: `minimum`, `maximum`, `exclusiveMinimum`, `exclusiveMaximum`.

::: callout-tip
If you are working with duration or interval values, you could consider converting them to a form that Sprout can parse and compare numerically. Here are some suggestions:

- If your intervals have start and end points, you could represent them using two `date`, `time` or `datetime` columns. For example, a column called `<column_name>_start` for the beginning of the interval and a column called `<column_name>_end` for the end of the interval.
- If it doesn't make sense to represent your duration values as intervals between start and end points, you could represent them as plain `integer`s or `number`s. For example, by calculating the number of days, hours, seconds, milliseconds, etc. (depending on the level of precision you need) that make up your durations.
:::

## Geopoint

A geographic point.

- Expected format: `LAT, LONG`. The space is optional.
- Examples: `45.50, 90.50`, `45.50,90.50`, `-45.50, -90.50`.
Comment on lines +176 to +177
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be a table as well? :)


The underlying representation is an array of two `number`s.

**Restrictions:**

- Other `geopoint` formats are not yet supported.

## Array

A JSON array. Must be well-formed [JSON](http://json.org/). The underlying representation is `string`.

## Object

A JSON object. Must be well-formed [JSON](http://json.org/). The underlying representation is `string`.

## Geojson

A JSON object compliant with the [GeoJSON](http://geojson.org/) or [TopoJSON](https://github.com/topojson/topojson-specification/blob/master/README.md) specification. The underlying representation is `string`.

**Restrictions:**

- `geojson` values are treated as plain `object`s. They are not checked against the GeoJSON or TopoJSON specification.

## Any

Unspecified or mixed values. The underlying representation is `string`.

## List

Not supported.