-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attributes #95
Attributes #95
Conversation
Just FYI: if we solve the AttributeError one way or another I can make some headway on using the scikit-learn internals to parallelize the predict function to gain some speedup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @AndreasKarasenko. Thank you for the contribution. Please see the comments.
skllm/llm/gpt/mixin.py
Outdated
self.org = org | ||
|
||
if self.key is None: | ||
self.key = self._get_openai_key() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I would not store the key from the global config inside the estimator. First of all, the idea of the global config is that the key can be dynamically propagated into all estimators (unless the estimator was provided with its own key) even after instantiation.
In addition, when the key is stored globally, the object can be safely pickled, whereas otherwise it would be necessary to remove the key first.
In other words, there should be support for per-estimator keys for cases when this is absolutely needed, but this should not be the default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I removed it.
skllm/llm/gpt/mixin.py
Outdated
raise RuntimeError("OpenAI key was not found") | ||
return key | ||
return openai_key | ||
|
||
def _get_openai_org(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already touched the refactoring of this mixin, it would be nice to also remove all the references of openai as these names are very unfortunate remnants from the early versions.
P.S. If you don't want to spend time on this, it is also fine to leave it as is.
skllm/llm/gpt/mixin.py
Outdated
@@ -212,7 +218,7 @@ def _get_embeddings(self, text: np.ndarray) -> List[List[float]]: | |||
|
|||
# for now this works only with OpenAI | |||
class GPTTunableMixin(BaseTunableMixin): | |||
_supported_tunable_models = ["gpt-3.5-turbo-0125", "gpt-3.5-turbo"] | |||
_supported_tunable_models = ["gpt-3.5-turbo-0613", "gpt-3.5-turbo"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was a mistake, I'm not sure how that got in.
@AndreasKarasenko , regarding the parallelization, implementing it would be a very nice improvement as openai is way more generous with the rate limits and parallel requests won't immediately consume all of the quota. However, I don't think there is a strong need to use any scikit-learn internals for that. |
I have done the minor updates requested and will work on a seperate PR for the parallelization, not sure how quick I will be though. I agree that threads are a better choice than spawning a whole process. |
For now I have a rudimentary parallel predict that I'll use for a paper over the weekend. Throttling when too many calls are made would be nice. The issue remains that RPM and TPM depend on the model and are subject to change. |
Thanks a lot for the changes. I think we can merge now, but will have a closer look a bit later. Regarding the parallelism:
|
Fixes #93. Also propagates key, org to the vectorizer at an earlier stage if it is set through the model.
However key, org might be more easily exposed.
At the same time currently it is possible to set key, org like below which still exposed them through
__dict__
.