-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parellizable Specific Annotations #1272
base: main
Are you sure you want to change the base?
Parellizable Specific Annotations #1272
Conversation
Documentation of ParallelizableList and is_parallelizable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Reviewed everything up to 9946dc0 in 1 minute and 21 seconds
More details
- Looked at
158
lines of code in4
files - Skipped
0
files when reviewing. - Skipped posting
5
drafted comments based on config settings.
1. hamilton/htypes.py:319
- Draft comment:
Consider usingissubclass(type, Parallelizable)
instead of checkingtype.__bases__
to ensure all subclasses are covered.
return issubclass(type, Parallelizable)
- Reason this comment was not posted:
Marked as duplicate.
2. hamilton/htypes.py:323
- Draft comment:
Consider usingis_parallelizable(_get_origin(type_))
to ensure all parallelizable types are covered consistently.
return is_parallelizable(_get_origin(type_))
- Reason this comment was not posted:
Comment was on unchanged code.
3. hamilton/htypes.py:312
- Draft comment:
The functionis_parallelizable
is defined but not used. Consider usingis_parallelizable_type
consistently across the codebase to avoid redundancy and adhere to the DRY principle. - Reason this comment was not posted:
Marked as duplicate.
4. hamilton/node.py:288
- Draft comment:
The logic for checking if a type is parallelizable is duplicated here and incustom_subclass_check
inhtypes.py
. Consider refactoring to avoid redundancy and adhere to the DRY principle. - Reason this comment was not posted:
Marked as duplicate.
5. tests/execution/test_executors.py:353
- Draft comment:
Consider adding documentation for thetest_parallel_list
function in thedocs/
to inform users about the newParallelizableList
feature and its usage. - Reason this comment was not posted:
Confidence changes required:50%
Thetest_parallel_list
function intest_executors.py
is a new test case added to test the functionality of theParallelizableList
. This should be documented in thedocs/
to inform users about the new feature and its usage.
Workflow ID: wflow_mdj29Tf93bIaKrN0
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
It was throwing exception when checking against None
Python 3.8 does not support list[], changed to typing.List[]
Since I worked on #1115, I figured I would chime in. Originally Note that I have run into something similar, but I just ended up casting/converting the |
@cswartzvi @EltonCN just spitballing would this be an okay API alternative? from hamilton.function_modifiers import parallelizable
@parallelizable
def hello1() -> list[str]: # allow lists and anything that could be a sequence etc.
return ["h", "a", "b"]
@parallelizable
def hello2() -> Generator[str, None, None]: # allow generators
for l in ["h", "a", "b"]:
yield l
a = hello1() # this would be a list
b = a[0] # this would be a string
for a in hello2():
a # this would be a string Then for the collect we'd have to figure out something equivalent... from hamilton.function_modifiers import collect, source
@collect("node_name")
def foo(node_name: list[TYPE]) -> ... |
@skrawcz Yes, this seen to be a better alternative than what I proposed, as the type annotation would directly indicate the returned type. |
@skrawcz Changes to If facing issues, the work by @cswartzvi on |
So I see the value here but the cynic in me wonders what the value of making it polymorphic in terms of the type being parallelized? E.G. we have two possibilities here:
So it's a trade-off between adding something in the framework that could potentially mean more work down the road, or working around it on the client side 🤔 The reason that it's an iterable is that we don't want to necessarily support the "casted" case, in which we cast everything to a list... E.G. in a longer stream we may want to not store everything in memory -- although the framework currently does maintain an internal cache, we'd like the option of moving towards a streaming mode later. Another option is to (1) keep the changes you have except for exposing the list -- what you've added is just better code/more robust, and it allows you, as the user, to subclass it! That way you could define the Regarding a decorator -- It's the same thing, allowing more complexity/removing the contract. But there is something nice about it -- it effectively puts the user in control of how it's implemented, giving more freedom. But restricting assumptions we can make later. So I think I like it, my only hesitancy is that I don't want to add another API for how to do this in this version of Hamilton. Having too many ways to do things is a code-smell, even if we're moving towards the "better" one. And it's additional complexity. |
Perhaps a compromise for the decorator would be to add it, but for now, internally, when Hamilton identifies it, it will internally change the function's annotation to use Parallelize/Collect, and treat the function the same way it already does (and the fuction still needs to return a Iterable). This way, there is no need to change the much of the code now, but this alternative can be tested, showing both in the documentation. Maybe check in time which one is used more using telemetry? There is still the problem of having 2 ways of defining the same thing... I also don't know exactly what assumptions could no longer be made, but a decorator seems more expressive to me than the annotation, since it is possible to configure other aspects of how the return would be parallelized? In any case, when Hamilton gets the return to parallelize, it would always treat it as an Iterable (since both the list and Generator are subclasses of it). The problem of keeping the elements in memory if it is a list, I think it is just a matter of letting the user know that this will happen. I still haven't been able to understand exactly what the problem would be in already providing the ParallelizableList directly, but adding just the use of "is_parallelizable" also seems sufficient to me, and the possibility of adding the decorator can be rethought/planned by you in a future version. |
Yeah -- adding the decorator but having it compile down to the current way it works is the same -- narrows surface area. Regarding limitations about using it directly -- it's more about what we'd like to do in the future. E.G. allowing for a potential future in which we have more of a streaming pipeline. That said, we don't do that now... IMO we should just have the |
Windows use backslashes and the generated path for the URLs was been generated with mixed "/" and "\"
This reverts commit ec92486.
@elijahbenizzy I implemented the example and added a link to it in the docs. The "ParallelizableList" was removed from the main code. I also needed to change the "validade_example" because it didn't work on Windows. About the CI errors, I'm not sure if they were from some modification I made, since they are in the integrations... |
When annotating a function with Parallelizable, it is not possible to specify in the annotation what the type returned by the function will actually be, as these are not identified by a linter.
So, in the example:
The type of b is not identified correctly, and no hints occur when writing the call to some method of a.
This issue is especially important for writing functions that can work with or without Hamilton with greater usability.
Changes
How I tested this
New "test_parallel_list" in "tests/execution/test_executors.py".
Notes
Checklist