Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39466: [Go][Parquet] Align Arrow and Parquet Timestamp Instant/Local Semantics #39467

Merged
merged 4 commits into from
Jan 18, 2024

Conversation

joellubi
Copy link
Member

@joellubi joellubi commented Jan 5, 2024

Rationale for this change

Closes: #39466

What changes are included in this PR?

  • Update logic for determining whether an Arrow Timestamp should have isAdjustedToUTC=true on conversion to Parquet.
  • Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format backward-compatibilty rules.
  • Refactor Timestamp serialization methods to reduce duplicated code

Are these changes tested?

Yes,

  • Logical type mapping in existing test updated.
  • New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled.
  • New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps.

Are there any user-facing changes?

Yes, users of pqarrow.FileWriter will produce Parquet files in which the TIMESTAMP type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.

The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format document. Rust already implements the spec as described and #39489 has been reported due to a mismatch in the handling of convertedTypes in C++.

Copy link

github-actions bot commented Jan 5, 2024

⚠️ GitHub issue #39466 has been automatically assigned in GitHub to PR creator.

@@ -125,7 +125,7 @@ func isDictionaryReadSupported(dt arrow.DataType) bool {
}

func arrowTimestampToLogical(typ *arrow.TimestampType, unit arrow.TimeUnit) schema.LogicalType {
utc := typ.TimeZone == "" || typ.TimeZone == "UTC"
isAdjustedToUTC := typ.TimeZone != ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we also be checking if the timezone is already explicitly "UTC" / "utc" ?

Copy link
Member Author

@joellubi joellubi Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that scenario should have the same logic (isAdjustedToUTC=true) as any non-empty timezone, so in that case it should already be covered by this.

@zeroshade
Copy link
Member

@joellubi This needs to get rebased since the other PR was merged (for the DATE type handling)

@joellubi
Copy link
Member Author

@joellubi This needs to get rebased since the other PR was merged (for the DATE type handling)

Thanks @zeroshade, just rebased

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 10, 2024
@zeroshade zeroshade merged commit 858574d into apache:main Jan 18, 2024
24 checks passed
@zeroshade zeroshade removed the awaiting committer review Awaiting committer review label Jan 18, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 858574d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…nt/Local Semantics (apache#39467)

### Rationale for this change

Closes: apache#39466 

### What changes are included in this PR?

- Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet.
- Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules.
- Refactor Timestamp serialization methods to reduce duplicated code

### Are these changes tested?

Yes,
- Logical type mapping in existing test updated.
- New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled.
- New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps.

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.

The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++.

* Closes: apache#39466

Authored-by: Joel Lubinitsky <joel@cherre.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…nt/Local Semantics (apache#39467)

### Rationale for this change

Closes: apache#39466 

### What changes are included in this PR?

- Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet.
- Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules.
- Refactor Timestamp serialization methods to reduce duplicated code

### Are these changes tested?

Yes,
- Logical type mapping in existing test updated.
- New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled.
- New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps.

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.

The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++.

* Closes: apache#39466

Authored-by: Joel Lubinitsky <joel@cherre.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…nt/Local Semantics (apache#39467)

### Rationale for this change

Closes: apache#39466 

### What changes are included in this PR?

- Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet.
- Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules.
- Refactor Timestamp serialization methods to reduce duplicated code

### Are these changes tested?

Yes,
- Logical type mapping in existing test updated.
- New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled.
- New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps.

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.

The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++.

* Closes: apache#39466

Authored-by: Joel Lubinitsky <joel@cherre.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…nt/Local Semantics (apache#39467)

### Rationale for this change

Closes: apache#39466 

### What changes are included in this PR?

- Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet.
- Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules.
- Refactor Timestamp serialization methods to reduce duplicated code

### Are these changes tested?

Yes,
- Logical type mapping in existing test updated.
- New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled.
- New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps.

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.

The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++.

* Closes: apache#39466

Authored-by: Joel Lubinitsky <joel@cherre.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Go][Parquet] Timestamp conversion from Arrow to Parquet does not match expected timezone semantics
2 participants