Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parellizable Specific Annotations #1272

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

EltonCN
Copy link

@EltonCN EltonCN commented Jan 16, 2025

When annotating a function with Parallelizable, it is not possible to specify in the annotation what the type returned by the function will actually be, as these are not identified by a linter.

So, in the example:

def hello() -> Parallizable[str]:
  return ["h", "a", "b"]

a = hello()
b = a[0]

The type of b is not identified correctly, and no hints occur when writing the call to some method of a.

This issue is especially important for writing functions that can work with or without Hamilton with greater usability.

Changes

  • Changed the verification if a type is Parallelizable to also check for subclasses.
  • Added "ParallelizableList", exemplifying the use of the change and allowing the annotation of functions returning lists and must have parallelizable return.

How I tested this

New "test_parallel_list" in "tests/execution/test_executors.py".

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to 9946dc0 in 1 minute and 21 seconds

More details
  • Looked at 158 lines of code in 4 files
  • Skipped 0 files when reviewing.
  • Skipped posting 5 drafted comments based on config settings.
1. hamilton/htypes.py:319
  • Draft comment:
    Consider using issubclass(type, Parallelizable) instead of checking type.__bases__ to ensure all subclasses are covered.
    return issubclass(type, Parallelizable)
  • Reason this comment was not posted:
    Marked as duplicate.
2. hamilton/htypes.py:323
  • Draft comment:
    Consider using is_parallelizable(_get_origin(type_)) to ensure all parallelizable types are covered consistently.
    return is_parallelizable(_get_origin(type_))
  • Reason this comment was not posted:
    Comment was on unchanged code.
3. hamilton/htypes.py:312
  • Draft comment:
    The function is_parallelizable is defined but not used. Consider using is_parallelizable_type consistently across the codebase to avoid redundancy and adhere to the DRY principle.
  • Reason this comment was not posted:
    Marked as duplicate.
4. hamilton/node.py:288
  • Draft comment:
    The logic for checking if a type is parallelizable is duplicated here and in custom_subclass_check in htypes.py. Consider refactoring to avoid redundancy and adhere to the DRY principle.
  • Reason this comment was not posted:
    Marked as duplicate.
5. tests/execution/test_executors.py:353
  • Draft comment:
    Consider adding documentation for the test_parallel_list function in the docs/ to inform users about the new ParallelizableList feature and its usage.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The test_parallel_list function in test_executors.py is a new test case added to test the functionality of the ParallelizableList. This should be documented in the docs/ to inform users about the new feature and its usage.

Workflow ID: wflow_mdj29Tf93bIaKrN0


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

It was throwing exception when checking against None
Python 3.8 does not support list[], changed to typing.List[]
@cswartzvi
Copy link
Contributor

cswartzvi commented Jan 17, 2025

Since I worked on #1115, I figured I would chime in. Originally Parallelizable and Collect were abstract base classes derived from Generator before they were loosened to Iterable based protocols. I believe the more concrete list was not used originally in order to signal intent - nodes that return Parallelizable in a dynamic DAG should not be mutable. Obviously, it's @elijahbenizzy @skrawcz call if the additional type is added, but perhaps using a collections.abc.Sequence instead of a list would preserve the original intent?

Note that I have run into something similar, but I just ended up casting/converting the Parallelizable return to the appropriate type 😅 . What might be nice down the road (again not my call) would be some sort of support for typing.Annotated (example Annotated[list[str], Parallelizable]) to signal to Hamilton that the node is parallelizable while still retaining the original type. However there is still the issue that the return type needs to iterable, so one might want to wait and see if something like pep-746 is accepted. Anyway, just my random 2 cents 😄

@skrawcz
Copy link
Collaborator

skrawcz commented Jan 17, 2025

@cswartzvi @EltonCN just spitballing would this be an okay API alternative?

from hamilton.function_modifiers import parallelizable

@parallelizable
def hello1() -> list[str]:  # allow lists and anything that could be a sequence etc.
  return ["h", "a", "b"]

@parallelizable
def hello2() -> Generator[str, None, None]:  # allow generators
  for l in ["h", "a", "b"]:
       yield l

a = hello1() # this would be a list
b = a[0] # this would be a string

for a in hello2(): 
  a  # this would be a string

Then for the collect we'd have to figure out something equivalent...

from hamilton.function_modifiers import collect, source

@collect("node_name") 
def foo(node_name: list[TYPE]) -> ...

@EltonCN
Copy link
Author

EltonCN commented Jan 17, 2025

@skrawcz Yes, this seen to be a better alternative than what I proposed, as the type annotation would directly indicate the returned type.

@zilto
Copy link
Collaborator

zilto commented Jan 18, 2025

@skrawcz Changes to Parallelizable/Collect could break caching, so it's worth adding a few tests to tests/caching/test_integration.py.

If facing issues, the work by @cswartzvi on TaskExecutionHook #1269 would help make caching more maintainable. It could be worth it to merge this PR first

@elijahbenizzy
Copy link
Collaborator

So I see the value here but the cynic in me wonders what the value of making it polymorphic in terms of the type being parallelized? E.G. we have two possibilities here:

  1. Extending the framework to add ParallelizableList -- this enables type-checking value
  2. Casting it to a list after the fact -- Parallelizable is effectively an iterator, which is a smaller contract (and less internal surface area for the framework).

So it's a trade-off between adding something in the framework that could potentially mean more work down the road, or working around it on the client side 🤔 The reason that it's an iterable is that we don't want to necessarily support the "casted" case, in which we cast everything to a list... E.G. in a longer stream we may want to not store everything in memory -- although the framework currently does maintain an internal cache, we'd like the option of moving towards a streaming mode later.

Another option is to (1) keep the changes you have except for exposing the list -- what you've added is just better code/more robust, and it allows you, as the user, to subclass it! That way you could define the ParallelizableList and it would work, but the fact that it's a list is not known to the framework. Thoughts?

Regarding a decorator -- It's the same thing, allowing more complexity/removing the contract. But there is something nice about it -- it effectively puts the user in control of how it's implemented, giving more freedom. But restricting assumptions we can make later. So I think I like it, my only hesitancy is that I don't want to add another API for how to do this in this version of Hamilton. Having too many ways to do things is a code-smell, even if we're moving towards the "better" one. And it's additional complexity.

@EltonCN
Copy link
Author

EltonCN commented Jan 20, 2025

Perhaps a compromise for the decorator would be to add it, but for now, internally, when Hamilton identifies it, it will internally change the function's annotation to use Parallelize/Collect, and treat the function the same way it already does (and the fuction still needs to return a Iterable). This way, there is no need to change the much of the code now, but this alternative can be tested, showing both in the documentation. Maybe check in time which one is used more using telemetry?

There is still the problem of having 2 ways of defining the same thing...

I also don't know exactly what assumptions could no longer be made, but a decorator seems more expressive to me than the annotation, since it is possible to configure other aspects of how the return would be parallelized?

In any case, when Hamilton gets the return to parallelize, it would always treat it as an Iterable (since both the list and Generator are subclasses of it). The problem of keeping the elements in memory if it is a list, I think it is just a matter of letting the user know that this will happen. I still haven't been able to understand exactly what the problem would be in already providing the ParallelizableList directly, but adding just the use of "is_parallelizable" also seems sufficient to me, and the possibility of adding the decorator can be rethought/planned by you in a future version.

@elijahbenizzy
Copy link
Collaborator

Perhaps a compromise for the decorator would be to add it, but for now, internally, when Hamilton identifies it, it will internally change the function's annotation to use Parallelize/Collect, and treat the function the same way it already does (and the fuction still needs to return a Iterable). This way, there is no need to change the much of the code now, but this alternative can be tested, showing both in the documentation. Maybe check in time which one is used more using telemetry?

There is still the problem of having 2 ways of defining the same thing...

I also don't know exactly what assumptions could no longer be made, but a decorator seems more expressive to me than the annotation, since it is possible to configure other aspects of how the return would be parallelized?

In any case, when Hamilton gets the return to parallelize, it would always treat it as an Iterable (since both the list and Generator are subclasses of it). The problem of keeping the elements in memory if it is a list, I think it is just a matter of letting the user know that this will happen. I still haven't been able to understand exactly what the problem would be in already providing the ParallelizableList directly, but adding just the use of "is_parallelizable" also seems sufficient to me, and the possibility of adding the decorator can be rethought/planned by you in a future version.

Yeah -- adding the decorator but having it compile down to the current way it works is the same -- narrows surface area.

Regarding limitations about using it directly -- it's more about what we'd like to do in the future. E.G. allowing for a potential future in which we have more of a streaming pipeline. That said, we don't do that now...

IMO we should just have the is_parallelizable function + any of the other changes -- this allows you to define a ParallelizableList. I also think that you should contribute this back as an example (E.G. using your own parallelizable ipmlementation). Thoughts?

@EltonCN
Copy link
Author

EltonCN commented Jan 27, 2025

@elijahbenizzy I implemented the example and added a link to it in the docs. The "ParallelizableList" was removed from the main code.

I also needed to change the "validade_example" because it didn't work on Windows.

About the CI errors, I'm not sure if they were from some modification I made, since they are in the integrations...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants