-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39489: [C++][Parquet] ConvertedType TIME and TIMESTAMP should imply adjustedToUtc #39491
GH-39489: [C++][Parquet] ConvertedType TIME and TIMESTAMP should imply adjustedToUtc #39491
Conversation
|
Given that you change something, but not any tests, that seems to indicate that this isn't actually tested?
Can we try to be more explicit about the exact impact of this change? In which cases does it change behaviour for a user? (I suppose it will change in some cases, otherwise why are we changing it?) |
@jorisvandenbossche Sorry for being so short. I forgot to upload testing previously. I upload the description in this patch. It only happens when:
I'm going to enjoy my life on weekends, maybe I can update a file generate by parquet-mr with legacy type and testing it tomorrow. |
@@ -90,7 +90,7 @@ Result<std::shared_ptr<ArrowType>> MakeArrowTime64(const LogicalType& logical_ty | |||
Result<std::shared_ptr<ArrowType>> MakeArrowTimestamp(const LogicalType& logical_type) { | |||
const auto& timestamp = checked_cast<const TimestampLogicalType&>(logical_type); | |||
const bool utc_normalized = | |||
timestamp.is_from_converted_type() ? false : timestamp.is_adjusted_to_utc(); | |||
timestamp.is_from_converted_type() ? true : timestamp.is_adjusted_to_utc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMIW, the change here is to align with cpp/src/parquet/types.cc where is_adjusted_to_utc of logical type is set to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, every caller except here is set to true. See: https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443
By the way, I didn't find Int96 in the doc, would Int96 should having is_adjusted_to_utc = true
? (Rust impl set it false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just
const bool utc_normalized = timestamp.is_adjusted_to_utc();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the following would be equivalent to the updated ternary expression:
const bool utc_normalized = timestamp.is_from_converted_type() || timestamp.is_adjusted_to_utc();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed to:
const bool utc_normalized = timestamp.is_adjusted_to_utc();
I think this is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, every caller except here is set to true. See: https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443
By the way, I didn't find Int96 in the doc, would Int96 should having
is_adjusted_to_utc = true
? (Rust impl set it false)
IMO, INT96 is a physical type and orthogonal to logical/converted type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INT96 is a physical type and orthogonal to logical/converted type.
Yeah, currently when we meet int96, we will bindly casting it to time-related, !adjustedToUtc
value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 The thrift schema would be casted here, and then goes to LogicalType::FromConvertedType
. It would set adjustedToUtc = true
: https://github.com/apache/arrow/blob/main/cpp/src/parquet/types.cc#L347-L358
So I think just timestamp.is_adjusted_to_utc()
is ok
@jorisvandenbossche @pitrou @wgtmac Would you mind take a look? |
319592a
to
4ab6336
Compare
4ab6336
to
dea3fe6
Compare
The top post of the PR is not very explicit about it, so just mentioning it here as well (I posted an example on the issue): we need to be aware this changes the compatibility with legacy files written by Arrow. The fact that we ignored Above it speaks about reading legacy files not written by Arrow:
While this indeed "fixes" reading of non-Arrow legacy files (following the Parquet spec), it does change behaviour also for files written by Arrow. And for those files we knew that the original Arrow data used for writing could have timestamps with or without timezone, and at the time we decided that the safe option was to "refuse to guess", and thus default to not attaching the UTC timezone to the read data. Nowadays it is maybe OK to make a different trade-off, and to prioritize the Parquet spec / non-Arrow legacy files above compatibility of reading Arrow legacy files, given that it is about files written by an Arrow version of several years old (Arrow < 0.15). But it should be a conscious decision (and labeled as a breaking change?). |
I think should we add a config in |
If we really care about this, we should perhaps detect the offending Arrow version, since the Parquet file records the producer name and version. But we would also need an actual file to test with. |
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit c752bdb. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…behavior to doc
…behavior to doc
…behavior to doc
…d imply adjustedToUtc (apache#39491) ### Rationale for this change TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc. For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2]. In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`. Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior. 2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type. in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem: 1. If `ARROW:schema` extended is provided, nothing is wrong 2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type. Further more, this could happening when dataset filtering happens. [1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485 [2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 [6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038 ### What changes are included in this PR? ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow ### Are these changes tested? Already ### Are there any user-facing changes? User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it. * Closes: apache#39489 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…d imply adjustedToUtc (apache#39491) ### Rationale for this change TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc. For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2]. In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`. Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior. 2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type. in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem: 1. If `ARROW:schema` extended is provided, nothing is wrong 2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type. Further more, this could happening when dataset filtering happens. [1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485 [2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 [6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038 ### What changes are included in this PR? ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow ### Are these changes tested? Already ### Are there any user-facing changes? User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it. * Closes: apache#39489 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…d imply adjustedToUtc (apache#39491) ### Rationale for this change TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc. For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2]. In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`. Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior. 2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type. in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem: 1. If `ARROW:schema` extended is provided, nothing is wrong 2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type. Further more, this could happening when dataset filtering happens. [1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485 [2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 [6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038 ### What changes are included in this PR? ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow ### Are these changes tested? Already ### Are there any user-facing changes? User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it. * Closes: apache#39489 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
Rationale for this change
TLDR: When reading from a Parquet file with legacy timestamp not written by arrow,
isAdjustedToUTC
would be ignored during read. And when filtering a file like this, filtering would not work.When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc.
For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp do not having a
adjustedToUtc
flag. So, for forward compatible, when reading it we need to regard it asadjustedToUtc
( A UTC Timestamp). See [1] [2].In our implementation, we use a
(LogicalType, PhysicalType)
pair,LogicalType
it's a wrapper for parquet thriftConvertedType
andLogicalType
, the code is listed in [3] :FieldToNode
andTimestampLogicalTypeFromArrowTimestamp
[4] to generate aTimestampLogicalType
, and generateisAdjustedUtc
by::arrow::TimestampType::timezone()
. Also, the arrow type will be aARROW:schema
extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType.arrow_properties.coerce_timestamps_enabled()
andarrow_properties.support_deprecated_int96_timestamps()
control the behavior.LogicalType
[5]. And then,FromParquetSchema
[6] would be called for arrow type.in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem:
ARROW:schema
extended is provided, nothing is wrongFurther more, this could happening when dataset filtering happens.
[1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485
[2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308
[3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283
[4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140
[5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443
[6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038
What changes are included in this PR?
ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow
Are these changes tested?
Already
Are there any user-facing changes?
User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it.