Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attributes #95

Merged
merged 4 commits into from
Jun 8, 2024
Merged

Attributes #95

merged 4 commits into from
Jun 8, 2024

Conversation

AndreasKarasenko
Copy link
Contributor

Fixes #93. Also propagates key, org to the vectorizer at an earlier stage if it is set through the model.
However key, org might be more easily exposed.
At the same time currently it is possible to set key, org like below which still exposed them through __dict__.

from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset
from skllm.config import SKLLMConfig

API_KEY="..."
SKLLMConfig.set_openai_key(API_KEY)

# demo sentiment analysis dataset
# labels: positive, negative, neutral
X, y = get_classification_dataset()

clf = ZeroShotGPTClassifier(model="gpt-3.5-turbo", key=API_KEY)
clf.__dict__ # exposes API_KEY - old version also exposes API_KEY
# {'model': 'gpt-3.5-turbo', 'default_label': 'Random', 'prompt_template': None, 'openai_key': '...', 'openai_org': None}
clf # exposes API_KEY - old version threw error

@AndreasKarasenko
Copy link
Contributor Author

Just FYI: if we solve the AttributeError one way or another I can make some headway on using the scikit-learn internals to parallelize the predict function to gain some speedup.

Copy link
Collaborator

@OKUA1 OKUA1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AndreasKarasenko. Thank you for the contribution. Please see the comments.

self.org = org

if self.key is None:
self.key = self._get_openai_key()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I would not store the key from the global config inside the estimator. First of all, the idea of the global config is that the key can be dynamically propagated into all estimators (unless the estimator was provided with its own key) even after instantiation.

In addition, when the key is stored globally, the object can be safely pickled, whereas otherwise it would be necessary to remove the key first.

In other words, there should be support for per-estimator keys for cases when this is absolutely needed, but this should not be the default behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I removed it.

raise RuntimeError("OpenAI key was not found")
return key
return openai_key

def _get_openai_org(self) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already touched the refactoring of this mixin, it would be nice to also remove all the references of openai as these names are very unfortunate remnants from the early versions.

P.S. If you don't want to spend time on this, it is also fine to leave it as is.

@@ -212,7 +218,7 @@ def _get_embeddings(self, text: np.ndarray) -> List[List[float]]:

# for now this works only with OpenAI
class GPTTunableMixin(BaseTunableMixin):
_supported_tunable_models = ["gpt-3.5-turbo-0125", "gpt-3.5-turbo"]
_supported_tunable_models = ["gpt-3.5-turbo-0613", "gpt-3.5-turbo"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was a mistake, I'm not sure how that got in.

@OKUA1
Copy link
Collaborator

OKUA1 commented May 31, 2024

@AndreasKarasenko , regarding the parallelization, implementing it would be a very nice improvement as openai is way more generous with the rate limits and parallel requests won't immediately consume all of the quota. However, I don't think there is a strong need to use any scikit-learn internals for that. predict() is basically just a for-loop which can be parallelized using the ThreadPool executor (the GIL would also be released during the API call).

@AndreasKarasenko
Copy link
Contributor Author

I have done the minor updates requested and will work on a seperate PR for the parallelization, not sure how quick I will be though. I agree that threads are a better choice than spawning a whole process.
Also no sure how I will handle the KNN part of the Dynamic FSL yet but let's see.

@AndreasKarasenko
Copy link
Contributor Author

For now I have a rudimentary parallel predict that I'll use for a paper over the weekend.
The kneighbors from the memory index should be thread safe, so calling the retrieve from multiple threads should not be an issue. In that sense maybe I will add a PR for this ?

Throttling when too many calls are made would be nice. The issue remains that RPM and TPM depend on the model and are subject to change.

@OKUA1
Copy link
Collaborator

OKUA1 commented Jun 7, 2024

Hi @AndreasKarasenko

Thanks a lot for the changes. I think we can merge now, but will have a closer look a bit later.

Regarding the parallelism:

  • It is better to open a separate issue first so we have a central place for a discussion.
  • You can create a PR for the implementation you already have. However, in a longer term it is probably better to have num_workers as an init argument. So maybe you can immediately add a deprecation warning or something similar (when num_workers > 1) to indicate that the API is temporary.

@OKUA1 OKUA1 merged commit bd8e13a into BeastByteAI:main Jun 8, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AttributeError: 'ZeroShotGPTClassifier' object has no attribute 'key'
2 participants