generated from streamnative/pulsar-io-template
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal-1: Partitioner Refactoring #847
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
shibd
previously approved these changes
Dec 20, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, A small question.
shibd
approved these changes
Dec 20, 2023
4 tasks
RobertIndie
added a commit
that referenced
this pull request
Jan 9, 2024
Proposal: #847 ### Motivation - Incorrect implementation of time partitioner: The current implementation only adds time information to the file path , while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition. The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not actually partition messages based on time. - Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner based on the current partitioner interface. The current connector first splits messages based on the topic, then allows the partitioner to generate the file path. As a result, all current partitioner implementations are based on the topic. - Non-intuitive Partitioner interface: The [existing Partitioner interface](https://github.com/streamnative/pulsar-io-cloud-storage/blob/master/src/main/java/org/apache/pulsar/io/jcloud/partitioner/Partitioner.java) is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected behavior of `encodePartition` and `generatePartitionedPath` is overlapped. We should make its implementation simple and clear enough. ### Modifications - Refactor the existing partitioner implementation to improve its intuitiveness and ease of use. - Introduce a new partitioner interface that includes two partitioners: Topic Partitioner and Time Partitioner. - Ensure backward compatibility to avoid disrupting current partitioner usage.
RobertIndie
added a commit
that referenced
this pull request
Jan 10, 2024
## Motivation - Incorrect implementation of time partitioner: The current implementation only adds time information to the file path , while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition. The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not actually partition messages based on time. - Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner based on the current partitioner interface. The current connector first splits messages based on the topic, then allows the partitioner to generate the file path. As a result, all current partitioner implementations are based on the topic. - Non-intuitive Partitioner interface: The [existing Partitioner interface](https://github.com/streamnative/pulsar-io-cloud-storage/blob/master/src/main/java/org/apache/pulsar/io/jcloud/partitioner/Partitioner.java) is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected behavior of `encodePartition` and `generatePartitionedPath` is overlapped. We should make its implementation simple and clear enough. ## High Level Design A new `Partitioner` interface will be added, with two partitioner implementations: `TopicPartitioner` and `TimePartitioner`. The behavior of these partitioners is as follows: - **Topic Partitioner**: Messages are partitioned according to the pre-existing partitions in the Pulsar topics. For instance, a message for the topic `public/default/my-topic-partition-0` would be directed to the file `public/default/my-topic-partition-0/xxx.json`, where `xxx` signifies the earliest message offset in this file. - **Time Partitioner**: Messages are partitioned based on the timestamp at the time of flushing. For the aforementioned message, it would be directed to the file `1703037311.json`, where `1703037311` represents the flush timestamp of the first message in this file. To ensure backward compatibility, the existing partitioner implementation will be maintained, and current user usage will not be disrupted. A Proof of Concept (PoC) implementation for this proposal can be found at: #845 (cherry picked from commit 29d9fde)
RobertIndie
added a commit
that referenced
this pull request
Jan 10, 2024
Proposal: #847 - Incorrect implementation of time partitioner: The current implementation only adds time information to the file path , while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition. The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not actually partition messages based on time. - Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner based on the current partitioner interface. The current connector first splits messages based on the topic, then allows the partitioner to generate the file path. As a result, all current partitioner implementations are based on the topic. - Non-intuitive Partitioner interface: The [existing Partitioner interface](https://github.com/streamnative/pulsar-io-cloud-storage/blob/master/src/main/java/org/apache/pulsar/io/jcloud/partitioner/Partitioner.java) is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected behavior of `encodePartition` and `generatePartitionedPath` is overlapped. We should make its implementation simple and clear enough. - Refactor the existing partitioner implementation to improve its intuitiveness and ease of use. - Introduce a new partitioner interface that includes two partitioners: Topic Partitioner and Time Partitioner. - Ensure backward compatibility to avoid disrupting current partitioner usage. (cherry picked from commit a8d9f1d)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Please see the proposal in this PR
Documentation
Check the box below.
Need to update docs?
doc-required
(If you need help on updating docs, create a doc issue)
no-need-doc
(Please explain why)
doc
(If this PR contains doc changes)