-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] chatqna: xeon pipeline fails (serious performance drop) when CPU affinity of tei and teirerank containers is managed #763
Comments
Some potentially relevant options from TEI readme: https://github.com/huggingface/text-embeddings-inference/blob/main/README.md which could be used when TEI containers are running on CPUs:
Currently only |
This bug blocks proper ChatQnA platform optimization demo on Xeon. |
@eero-t, thanks for the pointers. Adjusting
When run with limited tokenization-workers looks like this:
while corresponding lines without tokenization-workers limit are:
It looks like there are two pools of threads: tokenization workers and model backend. By default both contain equally many threads as there are physical CPU cores in the whole system. Each model backend thread tries to get a CPU affinity to both hyperthreads of each physical CPU core in the system (above output "mask: 1, 65" are hyperthreads of the same CPU core). Obviously only few succeed, as they happen to ask affinity to the CPUs that are included in the allowed CPUs for the container. There is no CPU affinity for threads in the tokenization worker pool. @yinghu5, @yongfengdu, I think this issue is not limited to Kubernetes. The same problem is expected when using docker with |
Opened two more precisely targeted bug reports against text-embeddings-interface, because the above issue has been written as a feature request. This is a bug that, for comparison, does not exist in text-generation-interface. Links to issues: |
Did you find any ENV/parameter settings that can workaround this? |
Yes. I did not see any effect. There were still too many threads and CPU affinity errors. (Even in the case that there would be a workaround that drops the thread count down to 1, it would also drop the performance in the inference to a fraction of what it could be when using all allowed CPUs...) |
Upstream fixed the related issues with huggingface/text-embeddings-inference#410 (exactly 2 weeks ago). It's assumed this can be closed once there's new TEI release and OPEA is updated to that. |
* Async support for some microservices Signed-off-by: lvliang-intel <liang1.lv@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix issues Signed-off-by: lvliang-intel <liang1.lv@intel.com> * fix issues Signed-off-by: lvliang-intel <liang1.lv@intel.com> * fix import issue Signed-off-by: lvliang-intel <liang1.lv@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add dependency library Signed-off-by: lvliang-intel <liang1.lv@intel.com> * fix issue Signed-off-by: lvliang-intel <liang1.lv@intel.com> * roll back pinecone change Signed-off-by: lvliang-intel <liang1.lv@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Priority
P2-High
OS type
Ubuntu
Hardware type
Xeon-SPR
Installation method
Deploy method
Running nodes
Single Node
What's the version?
Observed with latest chatqna.yaml (git 67394b8) where tei and teirerank containers use image:
ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
Description
When managing CPU affinity (with NRI resource policies or Kubernetes cpu-manager) on a node and creating ChatQnA/kubernetes/manifests/xeon/chatqna.yaml, tei and teirerank containers do not handle properly their internal threading and thread-CPU affinities.
They seem to create a thread for every CPU in the system, yet they should create a thread for every CPU allowed for the container.
In the logs it looks like this:
And in the system's process/thread's CPU affinity level like this:
That is, only few threads got correct CPU pinning, the rest (that are way too many) run on all allowed CPUs for the container. As a result this destroys the performance of tei and teirerank on CPU.
The log looks like the ort library is trying to create a thread and set affinity for every CPU in the system while it should not try to use any other than allowed CPUs (limited by cgroups cpuset.cpus). Cannot say if the root cause is in the ort library or how it is used here.
Reproduce steps
Raw log
No response
The text was updated successfully, but these errors were encountered: