Skip to content

Commit

Permalink
ChatQnA: Adapt to latest changes (#727)
Browse files Browse the repository at this point in the history
* ChatQnA: Adapt to latest changes

Adapt chatqna to latest following changes:

- make vLLM as the default inference engine
- adapt to latest changes in component data-prep, retriever, guardrails
- sync with docker compose files in GenAIExamples

Signed-off-by: Lianhao Lu <lianhao.lu@intel.com>

* Temporary disable ui testpod

Workaround for issue opea-project/GenAIExamples#1441.

Signed-off-by: Lianhao Lu <lianhao.lu@intel.com>

---------

Signed-off-by: Lianhao Lu <lianhao.lu@intel.com>
  • Loading branch information
lianhao authored Jan 21, 2025
1 parent 3e35874 commit 2d2e68c
Show file tree
Hide file tree
Showing 14 changed files with 195 additions and 188 deletions.
30 changes: 15 additions & 15 deletions helm-charts/chatqna/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,18 +23,17 @@ cd GenAIInfra/helm-charts/
helm dependency update chatqna
export HFTOKEN="insert-your-huggingface-token-here"
export MODELDIR="/mnt/opea-models"
export MODELNAME="Intel/neural-chat-7b-v3-3"
# If you would like to use the traditional UI, please change the image as well as the containerport within the values
# append these at the end of the command "--set chatqna-ui.image.repository=opea/chatqna-ui,chatqna-ui.image.tag=latest,chatqna-ui.containerPort=5173"
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME}
export MODELNAME="meta-llama/Meta-Llama-3-8B-Instruct"
# To use CPU with vLLM
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set vllm.LLM_MODEL_ID=${MODELNAME}
# To use Gaudi device with vLLM
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set vllm.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-vllm-values.yaml
# To use CPU with TGI
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/cpu-tgi-values.yaml
# To use Gaudi device with TGI
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-tgi-values.yaml
# To use Gaudi device with vLLM
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-vllm-values.yaml
# To use Nvidia GPU
# To use Nvidia GPU with TGI
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/nv-values.yaml
# To include guardrail component in chatqna on Xeon with TGI
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-values.yaml
# To include guardrail component in chatqna on Gaudi with TGI
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-gaudi-values.yaml
```
Expand Down Expand Up @@ -74,12 +73,13 @@ Open a browser to access `http://<k8s-node-ip-address>:${port}` to play with the

## Values

| Key | Type | Default | Description |
| ----------------- | ------ | ----------------------------- | -------------------------------------------------------------------------------------- |
| image.repository | string | `"opea/chatqna"` | |
| service.port | string | `"8888"` | |
| tgi.LLM_MODEL_ID | string | `"Intel/neural-chat-7b-v3-3"` | Models id from https://huggingface.co/, or predownloaded model directory |
| global.monitoring | bool | `false` | Enable usage metrics for the service components. See ../monitoring.md before enabling! |
| Key | Type | Default | Description |
| ----------------- | ------ | --------------------------------------- | -------------------------------------------------------------------------------------- |
| image.repository | string | `"opea/chatqna"` | |
| service.port | string | `"8888"` | |
| tgi.LLM_MODEL_ID | string | `"meta-llama/Meta-Llama-3-8B-Instruct"` | Inference models for TGI |
| vllm.LLM_MODEL_ID | string | `"meta-llama/Meta-Llama-3-8B-Instruct"` | Inference models for vLLM |
| global.monitoring | bool | `false` | Enable usage metrics for the service components. See ../monitoring.md before enabling! |

## Troubleshooting

Expand Down
112 changes: 112 additions & 0 deletions helm-charts/chatqna/cpu-tgi-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

# Override CPU resource request and probe timing values in specific subcharts
#
# RESOURCES
#
# Resource request matching actual resource usage (with enough slack)
# is important when service is scaled up, so that right amount of pods
# get scheduled to right nodes.
#
# Because resource usage depends on the used devices, model, data type
# and SW versions, and this top-level chart has overrides for them,
# resource requests need to be specified here too.
#
# To test service without resource request, use "resources: {}".
#
# PROBES
#
# Inferencing pods startup / warmup takes *much* longer on CPUs than
# with acceleration devices, and their responses are also slower,
# especially when node is running several instances of these services.
#
# Kubernetes restarting pod before its startup finishes, or not
# sending it queries because it's not in ready state due to slow
# readiness responses, does really NOT help in getting faster responses.
#
# => probe timings need to be increased when running on CPU.

vllm:
enabled: false
tgi:
enabled: true
# TODO: add Helm value also for TGI data type option:
# https://github.com/opea-project/GenAIExamples/issues/330
LLM_MODEL_ID: meta-llama/Meta-Llama-3-8B-Instruct

# Potentially suitable values for scaling CPU TGI 2.2 with Intel/neural-chat-7b-v3-3 @ 32-bit:
#resources:
# limits:
# cpu: 8
# memory: 70Gi
# requests:
# cpu: 6
# memory: 65Gi

livenessProbe:
initialDelaySeconds: 8
periodSeconds: 8
failureThreshold: 24
timeoutSeconds: 4
readinessProbe:
initialDelaySeconds: 16
periodSeconds: 8
timeoutSeconds: 4
startupProbe:
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 180
timeoutSeconds: 2

teirerank:
RERANK_MODEL_ID: "BAAI/bge-reranker-base"

# Potentially suitable values for scaling CPU TEI v1.5 with BAAI/bge-reranker-base model:
resources:
limits:
cpu: 4
memory: 30Gi
requests:
cpu: 2
memory: 25Gi

livenessProbe:
initialDelaySeconds: 8
periodSeconds: 8
failureThreshold: 24
timeoutSeconds: 4
readinessProbe:
initialDelaySeconds: 8
periodSeconds: 8
timeoutSeconds: 4
startupProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 120

tei:
EMBEDDING_MODEL_ID: "BAAI/bge-base-en-v1.5"

# Potentially suitable values for scaling CPU TEI 1.5 with BAAI/bge-base-en-v1.5 model:
resources:
limits:
cpu: 4
memory: 4Gi
requests:
cpu: 2
memory: 3Gi

livenessProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 24
timeoutSeconds: 2
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 2
startupProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 120
110 changes: 3 additions & 107 deletions helm-charts/chatqna/cpu-values.yaml
Original file line number Diff line number Diff line change
@@ -1,109 +1,5 @@
# Copyright (C) 2024 Intel Corporation
# Copyright (C) 2025 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

# Override CPU resource request and probe timing values in specific subcharts
#
# RESOURCES
#
# Resource request matching actual resource usage (with enough slack)
# is important when service is scaled up, so that right amount of pods
# get scheduled to right nodes.
#
# Because resource usage depends on the used devices, model, data type
# and SW versions, and this top-level chart has overrides for them,
# resource requests need to be specified here too.
#
# To test service without resource request, use "resources: {}".
#
# PROBES
#
# Inferencing pods startup / warmup takes *much* longer on CPUs than
# with acceleration devices, and their responses are also slower,
# especially when node is running several instances of these services.
#
# Kubernetes restarting pod before its startup finishes, or not
# sending it queries because it's not in ready state due to slow
# readiness responses, does really NOT help in getting faster responses.
#
# => probe timings need to be increased when running on CPU.

tgi:
# TODO: add Helm value also for TGI data type option:
# https://github.com/opea-project/GenAIExamples/issues/330
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3

# Potentially suitable values for scaling CPU TGI 2.2 with Intel/neural-chat-7b-v3-3 @ 32-bit:
resources:
limits:
cpu: 8
memory: 70Gi
requests:
cpu: 6
memory: 65Gi

livenessProbe:
initialDelaySeconds: 8
periodSeconds: 8
failureThreshold: 24
timeoutSeconds: 4
readinessProbe:
initialDelaySeconds: 16
periodSeconds: 8
timeoutSeconds: 4
startupProbe:
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 180
timeoutSeconds: 2

teirerank:
RERANK_MODEL_ID: "BAAI/bge-reranker-base"

# Potentially suitable values for scaling CPU TEI v1.5 with BAAI/bge-reranker-base model:
resources:
limits:
cpu: 4
memory: 30Gi
requests:
cpu: 2
memory: 25Gi

livenessProbe:
initialDelaySeconds: 8
periodSeconds: 8
failureThreshold: 24
timeoutSeconds: 4
readinessProbe:
initialDelaySeconds: 8
periodSeconds: 8
timeoutSeconds: 4
startupProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 120

tei:
EMBEDDING_MODEL_ID: "BAAI/bge-base-en-v1.5"

# Potentially suitable values for scaling CPU TEI 1.5 with BAAI/bge-base-en-v1.5 model:
resources:
limits:
cpu: 4
memory: 4Gi
requests:
cpu: 2
memory: 3Gi

livenessProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 24
timeoutSeconds: 2
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 2
startupProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 120
image:
repository: opea/chatqna
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,15 @@
# Accelerate inferencing in heaviest components to improve performance
# by overriding their subchart values

vllm:
enabled: false
# TGI: largest bottleneck for ChatQnA
tgi:
enabled: true
accelDevice: "gaudi"
image:
repository: ghcr.io/huggingface/tgi-gaudi
tag: "2.0.6"
tag: "2.3.1"
resources:
limits:
habana.ai/gaudi: 1
Expand Down
5 changes: 2 additions & 3 deletions helm-charts/chatqna/gaudi-vllm-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@

tgi:
enabled: false

vllm:
enabled: true
shmSize: 1Gi
accelDevice: "gaudi"
image:
repository: opea/vllm-gaudi
Expand All @@ -19,7 +19,7 @@ vllm:
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1
failureThreshold: 120
failureThreshold: 180
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
Expand All @@ -39,7 +39,6 @@ vllm:
"--max-seq_len-to-capture", "2048"
]


# Reranking: second largest bottleneck when reranking is in use
# (i.e. query context docs have been uploaded with data-prep)
#
Expand Down
44 changes: 24 additions & 20 deletions helm-charts/chatqna/guardrails-gaudi-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,17 +44,18 @@ teirerank:
readinessProbe:
timeoutSeconds: 1

tgi:
tgi-guardrails:
enabled: true
accelDevice: "gaudi"
LLM_MODEL_ID: "meta-llama/Meta-Llama-Guard-2-8B"
image:
repository: ghcr.io/huggingface/tgi-gaudi
tag: "2.0.6"
tag: "2.3.1"
resources:
limits:
habana.ai/gaudi: 1
# higher limits are needed with extra input tokens added by rerank
MAX_INPUT_LENGTH: "2048"
MAX_TOTAL_TOKENS: "4096"
MAX_INPUT_LENGTH: "1024"
MAX_TOTAL_TOKENS: "2048"
CUDA_GRAPHS: ""
OMPI_MCA_btl_vader_single_copy_mechanism: "none"
ENABLE_HPU_GRAPH: "true"
Expand All @@ -75,34 +76,37 @@ tgi:
timeoutSeconds: 1
failureThreshold: 120

tgi-guardrails:
tgi:
enabled: false
vllm:
enabled: true
shmSize: 1Gi
accelDevice: "gaudi"
LLM_MODEL_ID: "meta-llama/Meta-Llama-Guard-2-8B"
image:
repository: ghcr.io/huggingface/tgi-gaudi
tag: "2.0.6"
repository: opea/vllm-gaudi
resources:
limits:
habana.ai/gaudi: 1
MAX_INPUT_LENGTH: "1024"
MAX_TOTAL_TOKENS: "2048"
CUDA_GRAPHS: ""
OMPI_MCA_btl_vader_single_copy_mechanism: "none"
ENABLE_HPU_GRAPH: "true"
LIMIT_HPU_GRAPH: "true"
USE_FLASH_ATTENTION: "true"
FLASH_ATTENTION_RECOMPUTE: "true"
livenessProbe:
startupProbe:
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1
failureThreshold: 180
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1
startupProbe:
livenessProbe:
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1
failureThreshold: 120

PT_HPU_ENABLE_LAZY_COLLECTIVES: "true"
OMPI_MCA_btl_vader_single_copy_mechanism: "none"

extraCmdArgs: [
"--tensor-parallel-size", "1",
"--block-size", "128",
"--max-num-seqs", "256",
"--max-seq_len-to-capture", "2048"
]
Loading

0 comments on commit 2d2e68c

Please sign in to comment.