Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal-1] Support Partitioner Refactoring #845

Merged
merged 10 commits into from
Jan 9, 2024
Merged

Conversation

RobertIndie
Copy link
Member

@RobertIndie RobertIndie commented Dec 18, 2023

Proposal: #847

Motivation

  • Incorrect implementation of time partitioner: The current implementation only adds time information to the file path
    , while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition.
    The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not
    actually partition messages based on time.

  • Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner
    based on the current partitioner interface. The current connector first splits messages based on the topic, then
    allows the partitioner to
    generate the file path. As a result, all current partitioner implementations are based on the topic.

  • Non-intuitive Partitioner interface:
    The existing Partitioner interface
    is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected
    behavior of encodePartition and generatePartitionedPath is overlapped.
    We should make its implementation simple and clear enough.

Modifications

  • Refactor the existing partitioner implementation to improve its intuitiveness and ease of use.
  • Introduce a new partitioner interface that includes two partitioners: Topic Partitioner and Time Partitioner.
  • Ensure backward compatibility to avoid disrupting current partitioner usage.

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests.

Documentation

Check the box below.

Need to update docs?

  • doc-required

    (If you need help on updating docs, create a doc issue)

  • no-need-doc

    (Please explain why)

  • doc

    (If this PR contains doc changes)

@RobertIndie RobertIndie self-assigned this Dec 18, 2023
Copy link

@RobertIndie:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

@github-actions github-actions bot added the doc-info-missing This pr needs to mark a document option in description label Dec 18, 2023
RobertIndie added a commit that referenced this pull request Dec 22, 2023
## Motivation

- Incorrect implementation of time partitioner: The current implementation only adds time information to the file path
  , while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition.
  The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not
  actually partition messages based on time.

- Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner
  based on the current partitioner interface. The current connector first splits messages based on the topic, then
  allows the partitioner to
  generate the file path. As a result, all current partitioner implementations are based on the topic.

- Non-intuitive Partitioner interface:
  The [existing Partitioner interface](https://github.com/streamnative/pulsar-io-cloud-storage/blob/master/src/main/java/org/apache/pulsar/io/jcloud/partitioner/Partitioner.java)
  is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected
  behavior of `encodePartition` and `generatePartitionedPath` is overlapped.
  We should make its implementation simple and clear enough.

## High Level Design

A new `Partitioner` interface will be added, with two partitioner implementations: `TopicPartitioner`
and `TimePartitioner`.

The behavior of these partitioners is as follows:

- **Topic Partitioner**: Messages are partitioned according to the pre-existing partitions in the Pulsar topics. For
  instance, a message for the topic `public/default/my-topic-partition-0` would be directed to the
  file `public/default/my-topic-partition-0/xxx.json`, where `xxx` signifies the earliest message offset in this file.
- **Time Partitioner**: Messages are partitioned based on the timestamp at the time of flushing. For the aforementioned
  message, it would be directed to the file `1703037311.json`, where `1703037311` represents the flush timestamp of the
  first message in this file.

To ensure backward compatibility, the existing partitioner implementation will be maintained, and current user usage
will not be disrupted.

A Proof of Concept (PoC) implementation for this proposal can be found
at: #845
@github-actions github-actions bot removed the doc-info-missing This pr needs to mark a document option in description label Jan 4, 2024
Copy link

github-actions bot commented Jan 4, 2024

@RobertIndie:Thanks for providing doc info!

@github-actions github-actions bot added the doc-required This pr needs a document label Jan 4, 2024
@RobertIndie RobertIndie marked this pull request as ready for review January 4, 2024 14:25
@RobertIndie RobertIndie requested a review from a team as a code owner January 4, 2024 14:25
@RobertIndie RobertIndie changed the title Proof of concept for Partitioner Refactoring [Proposa-1] Support Partitioner Refactoring Jan 4, 2024
Copy link
Member

@shibd shibd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some small comments.

@RobertIndie RobertIndie requested a review from shibd January 5, 2024 09:19
@RobertIndie RobertIndie changed the title [Proposa-1] Support Partitioner Refactoring [Proposal-1] Support Partitioner Refactoring Jan 5, 2024
@RobertIndie RobertIndie merged commit a8d9f1d into master Jan 9, 2024
3 checks passed
@RobertIndie RobertIndie deleted the partitioner branch January 9, 2024 02:00
RobertIndie added a commit that referenced this pull request Jan 9, 2024
### Motivation

Add doc for #845
After this PR gets approved, I will update the doc for other cloud storage connectors
@RobertIndie RobertIndie mentioned this pull request Jan 9, 2024
3 tasks
RobertIndie added a commit that referenced this pull request Jan 10, 2024
## Motivation

- Incorrect implementation of time partitioner: The current implementation only adds time information to the file path
  , while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition.
  The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not
  actually partition messages based on time.

- Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner
  based on the current partitioner interface. The current connector first splits messages based on the topic, then
  allows the partitioner to
  generate the file path. As a result, all current partitioner implementations are based on the topic.

- Non-intuitive Partitioner interface:
  The [existing Partitioner interface](https://github.com/streamnative/pulsar-io-cloud-storage/blob/master/src/main/java/org/apache/pulsar/io/jcloud/partitioner/Partitioner.java)
  is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected
  behavior of `encodePartition` and `generatePartitionedPath` is overlapped.
  We should make its implementation simple and clear enough.

## High Level Design

A new `Partitioner` interface will be added, with two partitioner implementations: `TopicPartitioner`
and `TimePartitioner`.

The behavior of these partitioners is as follows:

- **Topic Partitioner**: Messages are partitioned according to the pre-existing partitions in the Pulsar topics. For
  instance, a message for the topic `public/default/my-topic-partition-0` would be directed to the
  file `public/default/my-topic-partition-0/xxx.json`, where `xxx` signifies the earliest message offset in this file.
- **Time Partitioner**: Messages are partitioned based on the timestamp at the time of flushing. For the aforementioned
  message, it would be directed to the file `1703037311.json`, where `1703037311` represents the flush timestamp of the
  first message in this file.

To ensure backward compatibility, the existing partitioner implementation will be maintained, and current user usage
will not be disrupted.

A Proof of Concept (PoC) implementation for this proposal can be found
at: #845

(cherry picked from commit 29d9fde)
RobertIndie added a commit that referenced this pull request Jan 10, 2024
Proposal: #847

- Incorrect implementation of time partitioner: The current implementation only adds time information to the file path
  , while both the Simple Partitioner and Time Partitioner partition messages based on the topic partition.
  The Time Partitioner is merely a special case of the Simple Partitioner. The existing time partitioner does not
  actually partition messages based on time.

- Inflexible current partitioner: It's hard for now to implement a correct Time Partitioner
  based on the current partitioner interface. The current connector first splits messages based on the topic, then
  allows the partitioner to
  generate the file path. As a result, all current partitioner implementations are based on the topic.

- Non-intuitive Partitioner interface:
  The [existing Partitioner interface](https://github.com/streamnative/pulsar-io-cloud-storage/blob/master/src/main/java/org/apache/pulsar/io/jcloud/partitioner/Partitioner.java)
  is not user-friendly. It has four methods, but it actually don't need so many methods. For instance, the expected
  behavior of `encodePartition` and `generatePartitionedPath` is overlapped.
  We should make its implementation simple and clear enough.

- Refactor the existing partitioner implementation to improve its intuitiveness and ease of use.
- Introduce a new partitioner interface that includes two partitioners: Topic Partitioner and Time Partitioner.
- Ensure backward compatibility to avoid disrupting current partitioner usage.

(cherry picked from commit a8d9f1d)
RobertIndie added a commit that referenced this pull request Jan 10, 2024
### Motivation

Add doc for #845
After this PR gets approved, I will update the doc for other cloud storage connectors

(cherry picked from commit 6a6069a)
RobertIndie added a commit that referenced this pull request Jan 16, 2024
### Motivation

In #845, we have deprecated `partitionerType` and recommand users to use `partitioner`.

Set the default value of `partitionerType` to `partition`. So that we could be able to not configure it. Otherwise, it will raise exceptions if we don't configure it.

### Modifications

- Add default value for the partitionType to `partition`
RobertIndie added a commit that referenced this pull request Jan 16, 2024
### Motivation

In #845, we have deprecated `partitionerType` and recommand users to use `partitioner`.

Set the default value of `partitionerType` to `partition`. So that we could be able to not configure it. Otherwise, it will raise exceptions if we don't configure it.

### Modifications

- Add default value for the partitionType to `partition`

(cherry picked from commit ba24f85)
shibd added a commit to shibd/pulsar-io-cloud-storage that referenced this pull request Oct 17, 2024
shibd added a commit that referenced this pull request Oct 22, 2024
* Revert "Add default value for the partitionType to `partition` (#863)"

This reverts commit ba24f85.

* Revert "[Proposal-1] Support Partitioner Refactoring (#845)"

This reverts commit a8d9f1d.

* feat: Add new batch model
shibd added a commit that referenced this pull request Oct 22, 2024
* Revert "Add default value for the partitionType to `partition` (#863)"

This reverts commit ba24f85.

* Revert "[Proposal-1] Support Partitioner Refactoring (#845)"

This reverts commit a8d9f1d.

* feat: Add new batch model

(cherry picked from commit 4b94112)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants