-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal-1] Support Partitioner Refactoring #845
Conversation
@RobertIndie:Thanks for your contribution. For this PR, do we need to update docs? |
## Motivation - Incorrect implementation of time partitioner: The current implementation only adds time information to the file path , while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition. The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not actually partition messages based on time. - Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner based on the current partitioner interface. The current connector first splits messages based on the topic, then allows the partitioner to generate the file path. As a result, all current partitioner implementations are based on the topic. - Non-intuitive Partitioner interface: The [existing Partitioner interface](https://github.com/streamnative/pulsar-io-cloud-storage/blob/master/src/main/java/org/apache/pulsar/io/jcloud/partitioner/Partitioner.java) is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected behavior of `encodePartition` and `generatePartitionedPath` is overlapped. We should make its implementation simple and clear enough. ## High Level Design A new `Partitioner` interface will be added, with two partitioner implementations: `TopicPartitioner` and `TimePartitioner`. The behavior of these partitioners is as follows: - **Topic Partitioner**: Messages are partitioned according to the pre-existing partitions in the Pulsar topics. For instance, a message for the topic `public/default/my-topic-partition-0` would be directed to the file `public/default/my-topic-partition-0/xxx.json`, where `xxx` signifies the earliest message offset in this file. - **Time Partitioner**: Messages are partitioned based on the timestamp at the time of flushing. For the aforementioned message, it would be directed to the file `1703037311.json`, where `1703037311` represents the flush timestamp of the first message in this file. To ensure backward compatibility, the existing partitioner implementation will be maintained, and current user usage will not be disrupted. A Proof of Concept (PoC) implementation for this proposal can be found at: #845
@RobertIndie:Thanks for providing doc info! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some small comments.
src/main/java/org/apache/pulsar/io/jcloud/BlobStoreAbstractConfig.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/pulsar/io/jcloud/sink/BlobStoreAbstractSink.java
Outdated
Show resolved
Hide resolved
### Motivation Add doc for #845 After this PR gets approved, I will update the doc for other cloud storage connectors
## Motivation - Incorrect implementation of time partitioner: The current implementation only adds time information to the file path , while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition. The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not actually partition messages based on time. - Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner based on the current partitioner interface. The current connector first splits messages based on the topic, then allows the partitioner to generate the file path. As a result, all current partitioner implementations are based on the topic. - Non-intuitive Partitioner interface: The [existing Partitioner interface](https://github.com/streamnative/pulsar-io-cloud-storage/blob/master/src/main/java/org/apache/pulsar/io/jcloud/partitioner/Partitioner.java) is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected behavior of `encodePartition` and `generatePartitionedPath` is overlapped. We should make its implementation simple and clear enough. ## High Level Design A new `Partitioner` interface will be added, with two partitioner implementations: `TopicPartitioner` and `TimePartitioner`. The behavior of these partitioners is as follows: - **Topic Partitioner**: Messages are partitioned according to the pre-existing partitions in the Pulsar topics. For instance, a message for the topic `public/default/my-topic-partition-0` would be directed to the file `public/default/my-topic-partition-0/xxx.json`, where `xxx` signifies the earliest message offset in this file. - **Time Partitioner**: Messages are partitioned based on the timestamp at the time of flushing. For the aforementioned message, it would be directed to the file `1703037311.json`, where `1703037311` represents the flush timestamp of the first message in this file. To ensure backward compatibility, the existing partitioner implementation will be maintained, and current user usage will not be disrupted. A Proof of Concept (PoC) implementation for this proposal can be found at: #845 (cherry picked from commit 29d9fde)
Proposal: #847 - Incorrect implementation of time partitioner: The current implementation only adds time information to the file path , while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition. The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not actually partition messages based on time. - Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner based on the current partitioner interface. The current connector first splits messages based on the topic, then allows the partitioner to generate the file path. As a result, all current partitioner implementations are based on the topic. - Non-intuitive Partitioner interface: The [existing Partitioner interface](https://github.com/streamnative/pulsar-io-cloud-storage/blob/master/src/main/java/org/apache/pulsar/io/jcloud/partitioner/Partitioner.java) is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected behavior of `encodePartition` and `generatePartitionedPath` is overlapped. We should make its implementation simple and clear enough. - Refactor the existing partitioner implementation to improve its intuitiveness and ease of use. - Introduce a new partitioner interface that includes two partitioners: Topic Partitioner and Time Partitioner. - Ensure backward compatibility to avoid disrupting current partitioner usage. (cherry picked from commit a8d9f1d)
### Motivation In #845, we have deprecated `partitionerType` and recommand users to use `partitioner`. Set the default value of `partitionerType` to `partition`. So that we could be able to not configure it. Otherwise, it will raise exceptions if we don't configure it. ### Modifications - Add default value for the partitionType to `partition`
### Motivation In #845, we have deprecated `partitionerType` and recommand users to use `partitioner`. Set the default value of `partitionerType` to `partition`. So that we could be able to not configure it. Otherwise, it will raise exceptions if we don't configure it. ### Modifications - Add default value for the partitionType to `partition` (cherry picked from commit ba24f85)
This reverts commit a8d9f1d.
Proposal: #847
Motivation
Incorrect implementation of time partitioner: The current implementation only adds time information to the file path
, while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition.
The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not
actually partition messages based on time.
Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner
based on the current partitioner interface. The current connector first splits messages based on the topic, then
allows the partitioner to
generate the file path. As a result, all current partitioner implementations are based on the topic.
Non-intuitive Partitioner interface:
The existing Partitioner interface
is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected
behavior of
encodePartition
andgeneratePartitionedPath
is overlapped.We should make its implementation simple and clear enough.
Modifications
Verifying this change
This change added tests.
Documentation
Check the box below.
Need to update docs?
doc-required
(If you need help on updating docs, create a doc issue)
no-need-doc
(Please explain why)
doc
(If this PR contains doc changes)