-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
23bc567
commit 1474395
Showing
19 changed files
with
333 additions
and
87 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Add ``KeyValueIntHWM`` class, designed to manage HWM for partitioned data sources like Kafka topics. It extends the functionality of the base HWM classes to handle key-value pairs. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,3 +14,9 @@ HWM | |
:caption: File HWM | ||
|
||
file/index | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:caption: KeyValue HWM | ||
|
||
key_value/index |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
.. _key_value_hwm_classes: | ||
|
||
KeyValue HWM | ||
======== | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:caption: HWM classes | ||
:name: key_value_hwm_classes | ||
|
||
key_value_int_hwm | ||
|
||
What is KeyValue HWM? | ||
---------------------- | ||
|
||
The KeyValue High Water Mark (HWM) is a specialized class designed to manage and track incremental data changes in systems where data is stored or represented as key-value pairs, such as in message queues like Kafka. | ||
|
||
Use Case | ||
---------------------- | ||
|
||
The ``KeyValueHWM`` class is particularly beneficial in scenarios where there is a need to `incrementally <https://onetl.readthedocs.io/en/0.10.0/strategy/incremental_strategy.html>`_ upload data in an ETL process. | ||
|
||
For instance, in typical ETL processes using `Spark with Kafka <https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html>`_, data re-written entirely from all partitions in topics starting from **zero** offset. This approach can be inefficient, time-consuming and create duplicates in target. By leveraging the ``KeyValueIntHWM`` class, it becomes possible to track the last offset of data processed. This enables the ETL process to only write new data increments, significantly reducing the amount of data transferred during each run. | ||
|
||
Example Usage with Kafka Messages | ||
--------------------------------- | ||
|
||
Consider a Kafka topic with several partitions, each having its own offset indicating the latest message. | ||
|
||
Initial Kafka Topic State: | ||
|
||
.. code:: bash | ||
Partition 0: Offset 100 | ||
Partition 1: Offset 150 | ||
Partition 2: Offset 200 | ||
When a new batch of messages arrives, the offsets in the Kafka partitions are updated: | ||
|
||
.. code:: bash | ||
Partition 0: Offset 110 # 10 new messages | ||
Partition 1: Offset 155 # 5 new messages | ||
Partition 2: Offset 200 # No new messages | ||
Using the ``KeyValueIntHWM`` class, we can track these offsets: | ||
|
||
.. code:: python | ||
from etl_entities.hwm import KeyValueIntHWM | ||
initial_offsets = { | ||
0: 100, # Partition 0 offset 100 | ||
1: 150, # Partition 1 offset 150 | ||
2: 200, # Partition 2 offset 200 | ||
} | ||
# Creating an instance of KeyValueIntHWM with initial offsets | ||
hwm = KeyValueIntHWM(value=initial_offsets, ...) | ||
# Running some ETL process, which updates the HWM value after finish | ||
run_etl_process(hwm, new_batch_data) | ||
# HWM values after running the ETL process | ||
assert hwm.value == {0: 110, 1: 155, 2: 200} | ||
This approach ensures that only new messages (i.e., those after the last recorded offset in each partition) are considered in the next ETL process. For Partition 0 and Partition 1, the new offsets (110 and 155, respectively) are stored in the HWM, while Partition 2 remains unchanged as there were no new messages. | ||
|
||
|
||
Restrictions | ||
------------ | ||
|
||
- **Non-Decreasing Values**: The ``KeyValueHWM`` class is designed to handle only non-decreasing values. During the update process, if the new offset provided for a given partition is less than the current offset, the value will not be updated. | ||
|
||
- **Incomplete Key Updates**: If a key is not included in new hwm value, its value remains unchanged. This is essential because keys in systems like Kafka (partitions) cannot be deleted, and their last known |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
KeyValue Int HWM | ||
============= | ||
|
||
.. currentmodule:: etl_entities.hwm.key_value.key_value_int_hwm | ||
|
||
.. autoclass:: KeyValueIntHWM | ||
:members: name, set_value, dict, json, copy, update |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
# Copyright 2023 MTS (Mobile Telesystems) | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
from typing import Generic, TypeVar | ||
|
||
from frozendict import frozendict | ||
from pydantic import Field | ||
|
||
from etl_entities.entity import GenericModel | ||
from etl_entities.hwm.hwm import HWM | ||
|
||
KeyValueHWMValueType = TypeVar("KeyValueHWMValueType") | ||
KeyValueHWMType = TypeVar("KeyValueHWMType", bound="KeyValueHWM") | ||
|
||
|
||
class KeyValueHWM(HWM[frozendict], Generic[KeyValueHWMValueType], GenericModel): | ||
"""Base key value HWM type | ||
Parameters | ||
---------- | ||
name : ``str`` | ||
HWM unique name | ||
value : ``frozendict[Any, KeyValueHWMValueType]]]`` , default: ``frozendict`` | ||
HWM value | ||
description : ``str``, default: ``""`` | ||
Description of HWM | ||
source : Any, default: ``None`` | ||
HWM source, e.g. ``topic`` name | ||
expression : Any, default: ``None`` | ||
Expression used to generate HWM value, e.g. ``offset`` | ||
modified_time : :obj:`datetime.datetime`, default: current datetime | ||
HWM value modification time | ||
""" | ||
|
||
entity: str = Field(alias="topic") | ||
# value: frozendict with Any type for keys and KeyValueHWMValueType type for values. | ||
# Direct type specification for frozendict contents (e.g., frozendict[KeyType, ValueType]) | ||
# is supported only from Python 3.9 onwards. | ||
value: frozendict = Field(default_factory=frozendict) | ||
|
||
def update(self, new_data: dict) -> "KeyValueHWM[KeyValueHWMValueType]": | ||
""" | ||
Updates the HWM value based on provided new key-value data. This method only updates | ||
the value if the new value is greater than the current valur for a given key | ||
or if the key does not exist in the current value. | ||
.. note:: | ||
Changes the HWM value in place and returns the modified instance. | ||
Parameters | ||
---------- | ||
new_data : dict | ||
A dictionary representing new key-value data. For example: keys are partitions and values are offsets. | ||
Returns | ||
------- | ||
self : KeyValueHWM | ||
The instance with updated HWM value. | ||
Examples | ||
-------- | ||
.. code:: python | ||
from frozendict import frozendict | ||
from etl_entities.hwm import KeyValueHWM | ||
hwm = KeyValueHWM(value={0: 100, 1: 120}, ...) | ||
hwm.update({1: 125, 2: 130}) | ||
assert hwm.value == frozendict({0: 100, 1: 125, 2: 130}) | ||
# The offset for partition 1 is not updated as 123 is less than 125 | ||
hwm.update({1: 123}) | ||
assert hwm.value == frozendict({0: 100, 1: 125, 2: 130}) | ||
""" | ||
|
||
modified = False | ||
temp_dict = dict(self.value) | ||
|
||
for partition, new_offset in new_data.items(): | ||
current_offset = temp_dict.get(partition) | ||
if current_offset is None or new_offset > current_offset: | ||
temp_dict[partition] = new_offset | ||
modified = True | ||
|
||
# update the frozendict only if modifications were made. | ||
# this avoids unnecessary reassignment and creation of a new frozendict object, | ||
if modified: | ||
self.set_value(frozendict(temp_dict)) | ||
|
||
return self | ||
|
||
def __eq__(self, other): | ||
"""Checks equality of two HWM instances | ||
Params | ||
------- | ||
other : :obj:`etl_entities.hwm.key_value.key_value_hwm.KeyValueHWM` | ||
You can compare two :obj:`etl_entities.hwm.key_value.key_value_hwm.KeyValueHWM` instances, | ||
obj:`etl_entities.hwm.key_value.key_value_hwm.KeyValueHWM` with an :obj:`object`, | ||
if its value is comparable with the ``value`` attribute of HWM | ||
Returns | ||
-------- | ||
result : bool | ||
``True`` if both inputs are the same, ``False`` otherwise. | ||
""" | ||
|
||
if not isinstance(other, type(self)): | ||
return NotImplemented | ||
|
||
self_fields = self.dict(exclude={"modified_time"}) | ||
other_fields = other.dict(exclude={"modified_time"}) | ||
return self_fields == other_fields | ||
Oops, something went wrong.