Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal-1: Partitioner Refactoring #847

Merged
merged 4 commits into from
Dec 22, 2023
Merged

Proposal-1: Partitioner Refactoring #847

merged 4 commits into from
Dec 22, 2023

Conversation

RobertIndie
Copy link
Member

Motivation

Please see the proposal in this PR

Documentation

Check the box below.

Need to update docs?

  • doc-required

    (If you need help on updating docs, create a doc issue)

  • no-need-doc

    (Please explain why)

  • doc

    (If this PR contains doc changes)

@RobertIndie RobertIndie self-assigned this Dec 20, 2023
@github-actions github-actions bot added the doc This pr contains a document label Dec 20, 2023
@RobertIndie RobertIndie marked this pull request as ready for review December 20, 2023 02:59
@RobertIndie RobertIndie requested a review from a team as a code owner December 20, 2023 02:59
shibd
shibd previously approved these changes Dec 20, 2023
Copy link
Member

@shibd shibd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, A small question.

@RobertIndie RobertIndie merged commit 29d9fde into master Dec 22, 2023
@RobertIndie RobertIndie deleted the proposal-1 branch December 22, 2023 03:32
RobertIndie added a commit that referenced this pull request Jan 9, 2024
Proposal: #847

### Motivation

- Incorrect implementation of time partitioner: The current implementation only adds time information to the file path
  , while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition.
  The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not
  actually partition messages based on time.

- Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner
  based on the current partitioner interface. The current connector first splits messages based on the topic, then
  allows the partitioner to
  generate the file path. As a result, all current partitioner implementations are based on the topic.

- Non-intuitive Partitioner interface:
  The [existing Partitioner interface](https://github.com/streamnative/pulsar-io-cloud-storage/blob/master/src/main/java/org/apache/pulsar/io/jcloud/partitioner/Partitioner.java)
  is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected
  behavior of `encodePartition` and `generatePartitionedPath` is overlapped.
  We should make its implementation simple and clear enough.

### Modifications


- Refactor the existing partitioner implementation to improve its intuitiveness and ease of use.
- Introduce a new partitioner interface that includes two partitioners: Topic Partitioner and Time Partitioner.
- Ensure backward compatibility to avoid disrupting current partitioner usage.
RobertIndie added a commit that referenced this pull request Jan 10, 2024
## Motivation

- Incorrect implementation of time partitioner: The current implementation only adds time information to the file path
  , while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition.
  The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not
  actually partition messages based on time.

- Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner
  based on the current partitioner interface. The current connector first splits messages based on the topic, then
  allows the partitioner to
  generate the file path. As a result, all current partitioner implementations are based on the topic.

- Non-intuitive Partitioner interface:
  The [existing Partitioner interface](https://github.com/streamnative/pulsar-io-cloud-storage/blob/master/src/main/java/org/apache/pulsar/io/jcloud/partitioner/Partitioner.java)
  is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected
  behavior of `encodePartition` and `generatePartitionedPath` is overlapped.
  We should make its implementation simple and clear enough.

## High Level Design

A new `Partitioner` interface will be added, with two partitioner implementations: `TopicPartitioner`
and `TimePartitioner`.

The behavior of these partitioners is as follows:

- **Topic Partitioner**: Messages are partitioned according to the pre-existing partitions in the Pulsar topics. For
  instance, a message for the topic `public/default/my-topic-partition-0` would be directed to the
  file `public/default/my-topic-partition-0/xxx.json`, where `xxx` signifies the earliest message offset in this file.
- **Time Partitioner**: Messages are partitioned based on the timestamp at the time of flushing. For the aforementioned
  message, it would be directed to the file `1703037311.json`, where `1703037311` represents the flush timestamp of the
  first message in this file.

To ensure backward compatibility, the existing partitioner implementation will be maintained, and current user usage
will not be disrupted.

A Proof of Concept (PoC) implementation for this proposal can be found
at: #845

(cherry picked from commit 29d9fde)
RobertIndie added a commit that referenced this pull request Jan 10, 2024
Proposal: #847

- Incorrect implementation of time partitioner: The current implementation only adds time information to the file path
  , while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition.
  The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not
  actually partition messages based on time.

- Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner
  based on the current partitioner interface. The current connector first splits messages based on the topic, then
  allows the partitioner to
  generate the file path. As a result, all current partitioner implementations are based on the topic.

- Non-intuitive Partitioner interface:
  The [existing Partitioner interface](https://github.com/streamnative/pulsar-io-cloud-storage/blob/master/src/main/java/org/apache/pulsar/io/jcloud/partitioner/Partitioner.java)
  is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected
  behavior of `encodePartition` and `generatePartitionedPath` is overlapped.
  We should make its implementation simple and clear enough.

- Refactor the existing partitioner implementation to improve its intuitiveness and ease of use.
- Introduce a new partitioner interface that includes two partitioners: Topic Partitioner and Time Partitioner.
- Ensure backward compatibility to avoid disrupting current partitioner usage.

(cherry picked from commit a8d9f1d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants