Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39489: [C++][Parquet] ConvertedType TIME and TIMESTAMP should imply adjustedToUtc #39491

Merged
merged 4 commits into from
Jan 10, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Jan 7, 2024

Rationale for this change

TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, isAdjustedToUTC would be ignored during read. And when filtering a file like this, filtering would not work.

When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc.

For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp do not having a adjustedToUtc flag. So, for forward compatible, when reading it we need to regard it as adjustedToUtc ( A UTC Timestamp). See [1] [2].

In our implementation, we use a (LogicalType, PhysicalType) pair, LogicalType it's a wrapper for parquet thrift ConvertedType and LogicalType, the code is listed in [3] :

  1. During write, we always use FieldToNode and TimestampLogicalTypeFromArrowTimestamp [4] to generate a TimestampLogicalType, and generate isAdjustedUtc by ::arrow::TimestampType::timezone(). Also, the arrow type will be a ARROW:schema extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. arrow_properties.coerce_timestamps_enabled() and arrow_properties.support_deprecated_int96_timestamps() control the behavior.
  2. During read, we'll first parse parquet thrift and generate a LogicalType[5]. And then, FromParquetSchema[6] would be called for arrow type.

in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem:

  1. If ARROW:schema extended is provided, nothing is wrong
  2. Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!. So when it read legacy Parquet file not written by arrow, it would not have right type.

Further more, this could happening when dataset filtering happens.

[1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485
[2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308
[3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283
[4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140
[5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443
[6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038

What changes are included in this PR?

ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow

Are these changes tested?

Already

Are there any user-facing changes?

User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it.

@mapleFU mapleFU requested a review from wgtmac as a code owner January 7, 2024 08:59
Copy link

github-actions bot commented Jan 7, 2024

⚠️ GitHub issue #39489 has been automatically assigned in GitHub to PR creator.

@jorisvandenbossche
Copy link
Member

Are these changes tested? Already

Given that you change something, but not any tests, that seems to indicate that this isn't actually tested?

User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it.

Can we try to be more explicit about the exact impact of this change? In which cases does it change behaviour for a user? (I suppose it will change in some cases, otherwise why are we changing it?)

@mapleFU
Copy link
Member Author

mapleFU commented Jan 7, 2024

@jorisvandenbossche Sorry for being so short. I forgot to upload testing previously. I upload the description in this patch. It only happens when:

  1. reading a file with legacy type not written by arrow
  2. Filtering

I'm going to enjoy my life on weekends, maybe I can update a file generate by parquet-mr with legacy type and testing it tomorrow.

@@ -90,7 +90,7 @@ Result<std::shared_ptr<ArrowType>> MakeArrowTime64(const LogicalType& logical_ty
Result<std::shared_ptr<ArrowType>> MakeArrowTimestamp(const LogicalType& logical_type) {
const auto& timestamp = checked_cast<const TimestampLogicalType&>(logical_type);
const bool utc_normalized =
timestamp.is_from_converted_type() ? false : timestamp.is_adjusted_to_utc();
timestamp.is_from_converted_type() ? true : timestamp.is_adjusted_to_utc();
Copy link
Member

@wgtmac wgtmac Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMIW, the change here is to align with cpp/src/parquet/types.cc where is_adjusted_to_utc of logical type is set to true.

Copy link
Member Author

@mapleFU mapleFU Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, every caller except here is set to true. See: https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443

By the way, I didn't find Int96 in the doc, would Int96 should having is_adjusted_to_utc = true ? (Rust impl set it false)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just

  const bool utc_normalized = timestamp.is_adjusted_to_utc();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the following would be equivalent to the updated ternary expression:

const bool utc_normalized = timestamp.is_from_converted_type() || timestamp.is_adjusted_to_utc();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed to:

  const bool utc_normalized = timestamp.is_adjusted_to_utc();

I think this is enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, every caller except here is set to true. See: https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443

By the way, I didn't find Int96 in the doc, would Int96 should having is_adjusted_to_utc = true ? (Rust impl set it false)

IMO, INT96 is a physical type and orthogonal to logical/converted type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INT96 is a physical type and orthogonal to logical/converted type.

Yeah, currently when we meet int96, we will bindly casting it to time-related, !adjustedToUtc value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joellubi :

https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 The thrift schema would be casted here, and then goes to LogicalType::FromConvertedType. It would set adjustedToUtc = true: https://github.com/apache/arrow/blob/main/cpp/src/parquet/types.cc#L347-L358

So I think just timestamp.is_adjusted_to_utc() is ok

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 8, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Jan 8, 2024

@jorisvandenbossche @pitrou @wgtmac Would you mind take a look?

@mapleFU mapleFU force-pushed the fixing-parquet-schema-with-time-utc branch from 319592a to 4ab6336 Compare January 8, 2024 18:12
@mapleFU mapleFU force-pushed the fixing-parquet-schema-with-time-utc branch from 4ab6336 to dea3fe6 Compare January 9, 2024 05:34
@pitrou pitrou changed the title GH-39489: [C++][Parquet] change ConvertType Thrift Timestamp to adjustUtc GH-39489: [C++][Parquet] ConvertedType TIME and TIMESTAMP should imply adjustedToUtc Jan 9, 2024
@mapleFU mapleFU merged commit c752bdb into apache:main Jan 10, 2024
33 of 34 checks passed
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Jan 10, 2024
@jorisvandenbossche
Copy link
Member

The top post of the PR is not very explicit about it, so just mentioning it here as well (I posted an example on the issue): we need to be aware this changes the compatibility with legacy files written by Arrow.

The fact that we ignored adjustedToUtc when constructing the Arrow TimestampType from a column that only used Parquet's legacy ConvertedType, was explicitly added as a way to ensure older files written by Arrow would still read correctly (#22303).

Above it speaks about reading legacy files not written by Arrow:

So when it read legacy Parquet file not written by arrow, it would not have right type.

While this indeed "fixes" reading of non-Arrow legacy files (following the Parquet spec), it does change behaviour also for files written by Arrow. And for those files we knew that the original Arrow data used for writing could have timestamps with or without timezone, and at the time we decided that the safe option was to "refuse to guess", and thus default to not attaching the UTC timezone to the read data.

Nowadays it is maybe OK to make a different trade-off, and to prioritize the Parquet spec / non-Arrow legacy files above compatibility of reading Arrow legacy files, given that it is about files written by an Arrow version of several years old (Arrow < 0.15). But it should be a conscious decision (and labeled as a breaking change?).

@mapleFU
Copy link
Member Author

mapleFU commented Jan 10, 2024

Nowadays it is maybe OK to make a different trade-off, and to prioritize the Parquet spec / non-Arrow legacy files above compatibility of reading Arrow legacy files, given that it is about files written by an Arrow version of several years old (Arrow < 0.15). But it should be a conscious decision (and labeled as a breaking change?).

I think should we add a config in ArrowReaderProperties, and shift to this impl in the coming versions?

@pitrou
Copy link
Member

pitrou commented Jan 10, 2024

Nowadays it is maybe OK to make a different trade-off, and to prioritize the Parquet spec / non-Arrow legacy files above compatibility of reading Arrow legacy files, given that it is about files written by an Arrow version of several years old (Arrow < 0.15).

If we really care about this, we should perhaps detect the offending Arrow version, since the Parquet file records the producer name and version.

But we would also need an actual file to test with.

@mapleFU mapleFU deleted the fixing-parquet-schema-with-time-utc branch January 11, 2024 02:26
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit c752bdb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

mapleFU added a commit to mapleFU/arrow that referenced this pull request Jan 11, 2024
mapleFU added a commit to mapleFU/arrow that referenced this pull request Jan 11, 2024
mapleFU added a commit to mapleFU/arrow that referenced this pull request Jan 11, 2024
clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…d imply adjustedToUtc (apache#39491)

### Rationale for this change

TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work.

When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc.

For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2].

In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 
1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`.  Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior.
2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type.

in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem:
1. If `ARROW:schema` extended is provided, nothing is wrong
2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type.

Further more, this could happening when dataset filtering happens.

[1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485
[2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308
[3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283
[4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140
[5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443
[6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038

### What changes are included in this PR?

ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow

### Are these changes tested?

Already

### Are there any user-facing changes?

User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it.

* Closes: apache#39489 

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…d imply adjustedToUtc (apache#39491)

### Rationale for this change

TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work.

When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc.

For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2].

In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 
1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`.  Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior.
2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type.

in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem:
1. If `ARROW:schema` extended is provided, nothing is wrong
2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type.

Further more, this could happening when dataset filtering happens.

[1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485
[2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308
[3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283
[4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140
[5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443
[6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038

### What changes are included in this PR?

ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow

### Are these changes tested?

Already

### Are there any user-facing changes?

User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it.

* Closes: apache#39489 

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…d imply adjustedToUtc (apache#39491)

### Rationale for this change

TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work.

When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc.

For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2].

In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 
1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`.  Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior.
2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type.

in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem:
1. If `ARROW:schema` extended is provided, nothing is wrong
2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type.

Further more, this could happening when dataset filtering happens.

[1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485
[2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308
[3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283
[4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140
[5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443
[6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038

### What changes are included in this PR?

ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow

### Are these changes tested?

Already

### Are there any user-facing changes?

User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it.

* Closes: apache#39489 

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
@mapleFU mapleFU added the Breaking Change Includes a breaking change to the API label Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking Change Includes a breaking change to the API Component: C++ Component: Parquet
Projects
None yet
5 participants