GH-39489: [C++][Parquet] ConvertedType TIME and TIMESTAMP should imply adjustedToUtc #39491

mapleFU · 2024-01-07T08:59:00Z

Rationale for this change

TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, isAdjustedToUTC would be ignored during read. And when filtering a file like this, filtering would not work.

When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc.

For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp do not having a adjustedToUtc flag. So, for forward compatible, when reading it we need to regard it as adjustedToUtc ( A UTC Timestamp). See [1] [2].

In our implementation, we use a (LogicalType, PhysicalType) pair, LogicalType it's a wrapper for parquet thrift ConvertedType and LogicalType, the code is listed in [3] :

During write, we always use FieldToNode and TimestampLogicalTypeFromArrowTimestamp [4] to generate a TimestampLogicalType, and generate isAdjustedUtc by ::arrow::TimestampType::timezone(). Also, the arrow type will be a ARROW:schema extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. arrow_properties.coerce_timestamps_enabled() and arrow_properties.support_deprecated_int96_timestamps() control the behavior.
During read, we'll first parse parquet thrift and generate a LogicalType[5]. And then, FromParquetSchema[6] would be called for arrow type.

in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem:

If ARROW:schema extended is provided, nothing is wrong
Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!. So when it read legacy Parquet file not written by arrow, it would not have right type.

Further more, this could happening when dataset filtering happens.

[1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485
[2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308
[3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283
[4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140
[5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443
[6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038

What changes are included in this PR?

ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow

Are these changes tested?

Already

Are there any user-facing changes?

User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it.

Closes: [C++][Parquet] Timestamp conversion from Parquet to Arrow does not follow compatibility guidelines for convertedType #39489

github-actions · 2024-01-07T08:59:27Z

⚠️ GitHub issue #39489 has been automatically assigned in GitHub to PR creator.

jorisvandenbossche · 2024-01-07T09:34:28Z

Are these changes tested? Already

Given that you change something, but not any tests, that seems to indicate that this isn't actually tested?

User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it.

Can we try to be more explicit about the exact impact of this change? In which cases does it change behaviour for a user? (I suppose it will change in some cases, otherwise why are we changing it?)

mapleFU · 2024-01-07T12:30:29Z

@jorisvandenbossche Sorry for being so short. I forgot to upload testing previously. I upload the description in this patch. It only happens when:

reading a file with legacy type not written by arrow
Filtering

I'm going to enjoy my life on weekends, maybe I can update a file generate by parquet-mr with legacy type and testing it tomorrow.

wgtmac · 2024-01-08T01:54:09Z

cpp/src/parquet/arrow/schema_internal.cc

@@ -90,7 +90,7 @@ Result<std::shared_ptr<ArrowType>> MakeArrowTime64(const LogicalType& logical_ty
 Result<std::shared_ptr<ArrowType>> MakeArrowTimestamp(const LogicalType& logical_type) {
  const auto& timestamp = checked_cast<const TimestampLogicalType&>(logical_type);
  const bool utc_normalized =
-      timestamp.is_from_converted_type() ? false : timestamp.is_adjusted_to_utc();
+      timestamp.is_from_converted_type() ? true : timestamp.is_adjusted_to_utc();


CMIW, the change here is to align with cpp/src/parquet/types.cc where is_adjusted_to_utc of logical type is set to true.

Here, every caller except here is set to true. See: https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443

By the way, I didn't find Int96 in the doc, would Int96 should having is_adjusted_to_utc = true ? (Rust impl set it false)

Why not just

const bool utc_normalized = timestamp.is_adjusted_to_utc();

I think the following would be equivalent to the updated ternary expression:

const bool utc_normalized = timestamp.is_from_converted_type() || timestamp.is_adjusted_to_utc();

I've changed to:

const bool utc_normalized = timestamp.is_adjusted_to_utc();

I think this is enough

Here, every caller except here is set to true. See: https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443

By the way, I didn't find Int96 in the doc, would Int96 should having is_adjusted_to_utc = true ? (Rust impl set it false)

IMO, INT96 is a physical type and orthogonal to logical/converted type.

INT96 is a physical type and orthogonal to logical/converted type.

Yeah, currently when we meet int96, we will bindly casting it to time-related, !adjustedToUtc value

@joellubi :

https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 The thrift schema would be casted here, and then goes to LogicalType::FromConvertedType. It would set adjustedToUtc = true: https://github.com/apache/arrow/blob/main/cpp/src/parquet/types.cc#L347-L358

So I think just timestamp.is_adjusted_to_utc() is ok

mapleFU · 2024-01-08T16:02:00Z

@jorisvandenbossche @pitrou @wgtmac Would you mind take a look?

cpp/src/parquet/types.cc

jorisvandenbossche · 2024-01-10T09:18:30Z

The top post of the PR is not very explicit about it, so just mentioning it here as well (I posted an example on the issue): we need to be aware this changes the compatibility with legacy files written by Arrow.

The fact that we ignored adjustedToUtc when constructing the Arrow TimestampType from a column that only used Parquet's legacy ConvertedType, was explicitly added as a way to ensure older files written by Arrow would still read correctly (#22303).

Above it speaks about reading legacy files not written by Arrow:

So when it read legacy Parquet file not written by arrow, it would not have right type.

While this indeed "fixes" reading of non-Arrow legacy files (following the Parquet spec), it does change behaviour also for files written by Arrow. And for those files we knew that the original Arrow data used for writing could have timestamps with or without timezone, and at the time we decided that the safe option was to "refuse to guess", and thus default to not attaching the UTC timezone to the read data.

Nowadays it is maybe OK to make a different trade-off, and to prioritize the Parquet spec / non-Arrow legacy files above compatibility of reading Arrow legacy files, given that it is about files written by an Arrow version of several years old (Arrow < 0.15). But it should be a conscious decision (and labeled as a breaking change?).

mapleFU · 2024-01-10T09:21:20Z

Nowadays it is maybe OK to make a different trade-off, and to prioritize the Parquet spec / non-Arrow legacy files above compatibility of reading Arrow legacy files, given that it is about files written by an Arrow version of several years old (Arrow < 0.15). But it should be a conscious decision (and labeled as a breaking change?).

I think should we add a config in ArrowReaderProperties, and shift to this impl in the coming versions?

pitrou · 2024-01-10T13:11:33Z

Nowadays it is maybe OK to make a different trade-off, and to prioritize the Parquet spec / non-Arrow legacy files above compatibility of reading Arrow legacy files, given that it is about files written by an Arrow version of several years old (Arrow < 0.15).

If we really care about this, we should perhaps detect the offending Arrow version, since the Parquet file records the producer name and version.

But we would also need an actual file to test with.

conbench-apache-arrow · 2024-01-11T04:14:30Z

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit c752bdb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

…behavior to doc

…d imply adjustedToUtc (apache#39491) ### Rationale for this change TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc. For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2]. In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`. Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior. 2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type. in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem: 1. If `ARROW:schema` extended is provided, nothing is wrong 2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type. Further more, this could happening when dataset filtering happens. [1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485 [2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 [6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038 ### What changes are included in this PR? ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow ### Are these changes tested? Already ### Are there any user-facing changes? User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it. * Closes: apache#39489 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>

apacheGH-39489: change ConvertType Thrift Timestamp to adjustUtc

6800756

mapleFU requested a review from wgtmac as a code owner January 7, 2024 08:59

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Jan 7, 2024

Update test changes

d15706c

add more comments

bba8938

wgtmac reviewed Jan 8, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 8, 2024

pitrou reviewed Jan 8, 2024

View reviewed changes

cpp/src/parquet/types.cc Show resolved Hide resolved

mapleFU force-pushed the fixing-parquet-schema-with-time-utc branch from 319592a to 4ab6336 Compare January 8, 2024 18:12

resolve comment

dea3fe6

mapleFU force-pushed the fixing-parquet-schema-with-time-utc branch from 4ab6336 to dea3fe6 Compare January 9, 2024 05:34

wgtmac approved these changes Jan 9, 2024

View reviewed changes

pitrou approved these changes Jan 9, 2024

View reviewed changes

pitrou changed the title ~~GH-39489: [C++][Parquet] change ConvertType Thrift Timestamp to adjustUtc~~ GH-39489: [C++][Parquet] ConvertedType TIME and TIMESTAMP should imply adjustedToUtc Jan 9, 2024

mapleFU merged commit c752bdb into apache:main Jan 10, 2024
33 of 34 checks passed

mapleFU removed the awaiting committer review Awaiting committer review label Jan 10, 2024

jorisvandenbossche mentioned this pull request Jan 10, 2024

[C++][Parquet] Timestamp conversion from Parquet to Arrow does not follow compatibility guidelines for convertedType #39489

Closed

mapleFU deleted the fixing-parquet-schema-with-time-utc branch January 11, 2024 02:26

mapleFU added a commit to mapleFU/arrow that referenced this pull request Jan 11, 2024

apacheGH-39489: [C++][Parquet] Revert apache#39491 and add timestamp …

be5c55a

…behavior to doc

mapleFU added a commit to mapleFU/arrow that referenced this pull request Jan 11, 2024

apacheGH-39489: [C++][Parquet] Revert apache#39491 and add timestamp …

407b7d7

…behavior to doc

mapleFU added a commit to mapleFU/arrow that referenced this pull request Jan 11, 2024

apacheGH-39489: [C++][Parquet] Revert apache#39491 and add timestamp …

e8e32c6

…behavior to doc

mapleFU added the Breaking Change Includes a breaking change to the API label Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-39489: [C++][Parquet] ConvertedType TIME and TIMESTAMP should imply adjustedToUtc #39491

GH-39489: [C++][Parquet] ConvertedType TIME and TIMESTAMP should imply adjustedToUtc #39491

mapleFU commented Jan 7, 2024 •

edited

Loading

github-actions bot commented Jan 7, 2024

jorisvandenbossche commented Jan 7, 2024

mapleFU commented Jan 7, 2024 •

edited

Loading

wgtmac Jan 8, 2024 •

edited

Loading

mapleFU Jan 8, 2024 •

edited

Loading

pitrou Jan 8, 2024

joellubi Jan 8, 2024

mapleFU Jan 9, 2024

wgtmac Jan 9, 2024

mapleFU Jan 9, 2024

mapleFU Jan 9, 2024

mapleFU commented Jan 8, 2024

jorisvandenbossche commented Jan 10, 2024

mapleFU commented Jan 10, 2024

pitrou commented Jan 10, 2024

conbench-apache-arrow bot commented Jan 11, 2024

GH-39489: [C++][Parquet] ConvertedType TIME and TIMESTAMP should imply adjustedToUtc #39491

GH-39489: [C++][Parquet] ConvertedType TIME and TIMESTAMP should imply adjustedToUtc #39491

Conversation

mapleFU commented Jan 7, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jan 7, 2024

jorisvandenbossche commented Jan 7, 2024

mapleFU commented Jan 7, 2024 • edited Loading

wgtmac Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

mapleFU Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

pitrou Jan 8, 2024

Choose a reason for hiding this comment

joellubi Jan 8, 2024

Choose a reason for hiding this comment

mapleFU Jan 9, 2024

Choose a reason for hiding this comment

wgtmac Jan 9, 2024

Choose a reason for hiding this comment

mapleFU Jan 9, 2024

Choose a reason for hiding this comment

mapleFU Jan 9, 2024

Choose a reason for hiding this comment

mapleFU commented Jan 8, 2024

jorisvandenbossche commented Jan 10, 2024

mapleFU commented Jan 10, 2024

pitrou commented Jan 10, 2024

conbench-apache-arrow bot commented Jan 11, 2024

mapleFU commented Jan 7, 2024 •

edited

Loading

mapleFU commented Jan 7, 2024 •

edited

Loading

wgtmac Jan 8, 2024 •

edited

Loading

mapleFU Jan 8, 2024 •

edited

Loading