-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39466: [Go][Parquet] Align Arrow and Parquet Timestamp Instant/Local Semantics #39467
Conversation
|
@@ -125,7 +125,7 @@ func isDictionaryReadSupported(dt arrow.DataType) bool { | |||
} | |||
|
|||
func arrowTimestampToLogical(typ *arrow.TimestampType, unit arrow.TimeUnit) schema.LogicalType { | |||
utc := typ.TimeZone == "" || typ.TimeZone == "UTC" | |||
isAdjustedToUTC := typ.TimeZone != "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we also be checking if the timezone is already explicitly "UTC" / "utc" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that scenario should have the same logic (isAdjustedToUTC=true
) as any non-empty timezone, so in that case it should already be covered by this.
@joellubi This needs to get rebased since the other PR was merged (for the DATE type handling) |
Thanks @zeroshade, just rebased |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 858574d. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…nt/Local Semantics (apache#39467) ### Rationale for this change Closes: apache#39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: apache#39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…nt/Local Semantics (apache#39467) ### Rationale for this change Closes: apache#39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: apache#39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…nt/Local Semantics (apache#39467) ### Rationale for this change Closes: apache#39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: apache#39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…nt/Local Semantics (apache#39467) ### Rationale for this change Closes: apache#39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: apache#39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Rationale for this change
Closes: #39466
What changes are included in this PR?
isAdjustedToUTC=true
on conversion to Parquet.Are these changes tested?
Yes,
Are there any user-facing changes?
Yes, users of
pqarrow.FileWriter
will produce Parquet files in which theTIMESTAMP
type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format document. Rust already implements the spec as described and #39489 has been reported due to a mismatch in the handling of convertedTypes in C++.