Skip to content

Commit

Permalink
Merge branch 'main' into jianyuzh_fix_template
Browse files Browse the repository at this point in the history
  • Loading branch information
NeoZhangJianyu authored Jan 16, 2025
2 parents 0f57169 + 3b76d39 commit 9d74467
Show file tree
Hide file tree
Showing 25 changed files with 463 additions and 114 deletions.
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Code owners will review PRs within their respective folders.
# Typically, ownership is organized at the second-level subdirectory under the homepage
/*/ kaokao.lv@intel.com
* kaokao.lv@intel.com
/evals/benchmark/ liang1.lv@intel.com
/evals/evaluation/ kaokao.lv@intel.com
/evals/metrics/ xinyu.ye@intel.com
Expand Down
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/1_bug_template.yml
Original file line number Diff line number Diff line change
Expand Up @@ -138,3 +138,4 @@ body:
description: Attach any relevant files or screenshots.
validations:
required: false

2 changes: 1 addition & 1 deletion .github/license_template.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
Copyright (C) 2024 Intel Corporation
Copyright (C) 2025 Intel Corporation
SPDX-License-Identifier: Apache-2.0
2 changes: 1 addition & 1 deletion .github/workflows/check-online-doc-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ on:

jobs:
build:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:

- name: Checkout
Expand Down
5 changes: 5 additions & 0 deletions .github/workflows/code_scan.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@ jobs:
- name: Checkout out Repo
uses: actions/checkout@v4

- name: Check Dangerous Command Injection
uses: opea-project/validation/actions/check-cmd@main
with:
work_dir: ${{ github.workspace }}

- name: Docker Build
run: |
docker build -f ${{ github.workspace }}/.github/workflows/docker/${{ env.DOCKER_FILE_NAME }}.dockerfile -t ${{ env.REPO_NAME }}:${{ env.REPO_TAG }} .
Expand Down
7 changes: 7 additions & 0 deletions .github/workflows/model_test_hpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,13 @@ jobs:
with:
submodules: "recursive"
fetch-tags: true

- name: Check Dangerous Command Injection
if: github.event_name == 'pull_request' || github.event_name == 'pull_request_target'
uses: opea-project/validation/actions/check-cmd@main
with:
work_dir: ${{ github.workspace }}

# We need this because GitHub needs to clone the branch to pipeline
- name: Docker Build
run: |
Expand Down
16 changes: 8 additions & 8 deletions evals/benchmark/stresscli/commands/config.ini
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,14 @@ End_to_End_latency_P50 = End to End latency\(ms\),\s+P50:\s+([\d.]+)
End_to_End_latency_P90 = End to End latency\(ms\),\s+P50:[\s\d.,]+P90:\s+([\d.]+)
End_to_End_latency_P99 = End to End latency\(ms\),\s+P50:[\s\d.,]+P90:\s+[\s\d.,]+P99:\s+([\d.]+)
End_to_End_latency_Avg = End to End latency\(ms\),\s+P50:[\s\d.,]+P90:\s+[\s\d.,]+P99:\s+[\s\d.,]+Avg:\s+([\d.]+)
First_token_latency_P50 = First token latency\(ms\),\s+P50:\s+([\d.]+)
First_token_latency_P90 = First token latency\(ms\),\s+P50:[\s\d.,]+P90:\s+([\d.]+)
First_token_latency_P99 = First token latency\(ms\),\s+P50:[\s\d.,]+P90:\s+[\s\d.,]+P99:\s+([\d.]+)
First_token_latency_Avg = First token latency\(ms\),\s+P50:[\s\d.,]+P90:\s+[\s\d.,]+P99:\s+[\s\d.,]+Avg:\s+([\d.]+)
Next_token_latency_P50 = Next token latency\(ms\),\s+P50:\s+([\d.]+)
Next_token_latency_P90 = Next token latency\(ms\),\s+P50:[\s\d.,]+P90:\s+([\d.]+)
Next_token_latency_P99 = Next token latency\(ms\),\s+P50:[\s\d.,]+P90:\s+[\s\d.,]+P99:\s+([\d.]+)
Next_token_latency_Avg = Next token latency\(ms\),\s+P50:[\s\d.,]+P90:\s+[\s\d.,]+P99:\s+[\s\d.,]+Avg:\s+([\d.]+)
Time_to_First_Token-TTFT_P50 = Time to First Token-TTFT\(ms\),\s+P50:\s+([\d.]+)
Time_to_First_Token-TTFT_P90 = Time to First Token-TTFT\(ms\),\s+P50:[\s\d.,]+P90:\s+([\d.]+)
Time_to_First_Token-TTFT_P99 = Time to First Token-TTFT\(ms\),\s+P50:[\s\d.,]+P90:\s+[\s\d.,]+P99:\s+([\d.]+)
Time_to_First_Token-TTFT_Avg = Time to First Token-TTFT\(ms\),\s+P50:[\s\d.,]+P90:\s+[\s\d.,]+P99:\s+[\s\d.,]+Avg:\s+([\d.]+)
Time_Per_Output_Token-TPOT_P50 = Time Per Output Token-TPOT\(ms\),\s+P50:\s+([\d.]+)
Time_Per_Output_Token-TPOT_P90 = Time Per Output Token-TPOT\(ms\),\s+P50:[\s\d.,]+P90:\s+([\d.]+)
Time_Per_Output_Token-TPOT_P99 = Time Per Output Token-TPOT\(ms\),\s+P50:[\s\d.,]+P90:\s+[\s\d.,]+P99:\s+([\d.]+)
Time_Per_Output_Token-TPOT_Avg = Time Per Output Token-TPOT\(ms\),\s+P50:[\s\d.,]+P90:\s+[\s\d.,]+P99:\s+[\s\d.,]+Avg:\s+([\d.]+)
Average_token_latency = Average token latency\(ms\)\s+:\s+([\d.]+)
locust_num_requests = \"num_requests\":\s+(\d+)
locust_num_failures = \"num_failures\":\s+(\d+)
Expand Down
22 changes: 21 additions & 1 deletion evals/benchmark/stresscli/locust/aistress.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,12 +120,16 @@ def bench_main(self):
"faqgenfixed",
"faqgenbench",
]
if self.environment.parsed_options.bench_target in ["faqgenfixed", "faqgenbench"]:
req_params = {"data": reqData}
else:
req_params = {"json": reqData}
test_start_time = time.time()
try:
start_ts = time.perf_counter()
with self.client.post(
url,
json=reqData,
**req_params,
stream=True if self.environment.parsed_options.bench_target in streaming_bench_target else False,
catch_response=True,
timeout=self.environment.parsed_options.http_timeout,
Expand Down Expand Up @@ -169,6 +173,22 @@ def bench_main(self):
complete_response += content
except json.JSONDecodeError:
continue
elif self.environment.parsed_options.bench_target in ["faqgenfixed", "faqgenbench"]:
client = sseclient.SSEClient(resp)
for event in client.events():
if first_token_ts is None:
first_token_ts = time.perf_counter()
try:
data = json.loads(event.data)
for op in data["ops"]:
if op["path"] == "/logs/HuggingFaceEndpoint/final_output":
generations = op["value"].get("generations", [])
for generation in generations:
for item in generation:
text = item.get("text", "")
complete_response += text
except json.JSONDecodeError:
continue
else:
client = sseclient.SSEClient(resp)
for event in client.events():
Expand Down
11 changes: 5 additions & 6 deletions evals/benchmark/stresscli/locust/faqgenfixed.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,11 @@ def getUrl():


def getReqData():
# return {
# "inputs": "What is the revenue of Nike in last 10 years before 2023? Give me detail",
# "parameters": {"max_new_tokens": 128, "do_sample": True},
# }
# return {"query": "What is the revenue of Nike in last 10 years before 2023? Give me detail", "max_tokens": 128}
return {"messages": "What is the revenue of Nike in last 10 years before 2023? Give me detail", "max_tokens": 128}
return {
"messages": "Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E6.",
"max_tokens": 128,
"top_k": 1,
}


def respStatics(environment, reqData, respData):
Expand Down
2 changes: 1 addition & 1 deletion evals/benchmark/stresscli/locust/tokenresponse.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ def testFunc():

def respStatics(environment, req, resp):
tokenizer = transformers.AutoTokenizer.from_pretrained(environment.parsed_options.llm_model)
if environment.parsed_options.bench_target in ["chatqnafixed", "chatqnabench"]:
if environment.parsed_options.bench_target in ["chatqnafixed", "chatqnabench", "faqgenfixed", "faqgenbench"]:
num_token_input_prompt = len(tokenizer.encode(req["messages"]))
elif environment.parsed_options.bench_target in ["llmfixed"]:
num_token_input_prompt = len(tokenizer.encode(req["query"]))
Expand Down
72 changes: 64 additions & 8 deletions evals/evaluation/agent_eval/crag_eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,29 +46,29 @@ cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/preprocess_data
bash run_data_preprocess.sh
```
**Note**: This is an example of data processing. You can develop and optimize your own data processing for this benchmark.
3. Sample queries for benchmark
3. (Optional) Sample queries for benchmark
The CRAG dataset has more than 4000 queries, and running all of them can be very expensive and time-consuming. You can sample a subset for benchmark. Here we provide a script to sample up to 5 queries per question_type per dynamism in each domain. For example, we were able to get 92 queries from the music domain using the script.
```
bash run_sample_data.sh
```

## Launch agent QnA system
Here we showcase a RAG agent in GenAIExample repo. Please refer to the README in the [AgentQnA example](https://github.com/opea-project/GenAIExamples/tree/main/AgentQnA/README.md) for more details.
Here we showcase an agent system in OPEA GenAIExamples repo. Please refer to the README in the [AgentQnA example](https://github.com/opea-project/GenAIExamples/tree/main/AgentQnA/README.md) for more details.

> **Please note**: This is an example. You can build your own agent systems using OPEA components, then expose your own systems as an endpoint for this benchmark.
To launch the agent in our AgentQnA example, open another terminal and build images and launch agent system there.
To launch the agent in our AgentQnA example on Intel Gaudi accelerators, open another terminal and follow the instructions below.
1. Build images
```
export $WORKDIR=<your-work-directory>
cd $WORKDIR
git clone https://github.com/opea-project/GenAIExamples.git
cd GenAIExamples/AgentQnA/tests/
bash 1_build_images.sh
bash step1_build_images.sh
```
2. Start retrieval tool
```
bash 2_start_retrieval_tool.sh
bash step2_start_retrieval_tool.sh
```
3. Ingest data into vector database and validate retrieval tool
```
Expand All @@ -86,19 +86,21 @@ python3 index_data.py --host_ip $host_ip --filedir ${WORKDIR}/datasets/crag_docs
```
# Go to the terminal where you launched the AgentQnA example
cd $WORKDIR/GenAIExamples/AgentQnA/tests/
bash 4_launch_and_validate_agent.sh
bash step4_launch_and_validate_agent_gaudi.sh
```
Note: There are two agents in the agent system: a RAG agent (as the worker agent) and a ReAct agent (as the supervisor agent). We can evaluate both agents - just need to specify the agent endpoint url in the scripts - see instructions below.

## Run CRAG benchmark
Once you have your agent system up and running, the next step is to generate answers with agent. Change the variables in the script below and run the script. By default, it will run a sampled set of queries in music domain.
Once you have your agent system up and running, the next step is to generate answers with agent. Change the variables in the script below and run the script. By default, it will run the entire set of queries in the music domain (in total 373 queries). You can choose to run other domains or just run a sampled subset of music domain.
```
# Come back to the interactive crag-eval docker container
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark
# Remember to specify the agent endpoint url in the script.
bash run_generate_answer.sh
```

## Use LLM-as-judge to grade the answers
1. Launch llm endpoint with HF TGI: in another terminal, run the command below. By default, `meta-llama/Meta-Llama-3-70B-Instruct` is used as the LLM judge.
1. Launch llm endpoint with HF TGI: in another terminal, run the command below. By default, `meta-llama/Meta-Llama-3.1-70B-Instruct` is used as the LLM judge.
```
cd llm_judge
bash launch_llm_judge_endpoint.sh
Expand All @@ -123,3 +125,57 @@ python3 test_llm_endpoint.py
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark/
bash run_grading.sh
```

### Validation of LLM-as-judge
We validated RAGAS answer correctness as the metric to evaluate agents. We sampled 92 queries from the full music domain dataset (up to 5 questions per sub-category for all 32 sub-categories), and conducted human evaluations on the conventional RAG answers, the single RAG agent answers and the hierarchical ReAct agent answers of the 92 queries.

We followed the criteria in the [CRAG paper](https://arxiv.org/pdf/2406.04744) to get human scores:
1. score 1 if the answer matches the golden answer or semantically similar.
2. score 0 if the answer misses information, or is "I don't know", “I’m sorry I can’t find ...”, a system error such as recursion limit is hit, or a request from the system to clarify the original question.
3. score -1 if the answer contains incorrect information.

On the other hand, RAGAS `answer_correctness` score is on a scale of 0-1 and is a weighted average of 1) an F1 score and 2) similarity between answer and golden answer. The F1 score is based on the number of statements in the answer supported or not supported by the golden answer, and the number of statements in the golden answer appeared or did not appear in the answer. Please refer to [RAGAS source code](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py) for the implementation of its `answer_correctness` score. We ran RAGAS on Intel Gaudi2 accelerators. We used `meta-llama/Meta-Llama-3.1-70B-Instruct` as the LLM judge.

|Setup |Mean Human score|Mean RAGAS `answer_correctness` score|
|----------------|-----------|------------------------------|
|Conventional RAG|0.05 |0.37|
|Single RAG agent|0.18 |0.43|
|Hierarchical ReAct agent|0.22|0.54|

We can see that the human scores and the RAGAS `answer_correctness` scores follow the same trend, although the two scoring methods used different grading criteria and methods. Since LLM-as-judge is more scalable for larger datasets, we decided to use RAGAS `answer_correctness` scores (produced by `meta-llama/Meta-Llama-3-70B-Instruct` as the LLM judge) for the evaluation of OPEA agents on the full CRAG music domain dataset.

We have made available our scripts to calculate the mean RAGAS scores. Refer to the `run_compare_scores.sh` script in the `run_benchmark` folder.


## Benchmark results for OPEA RAG Agents
We have evaluated the agents (`rag_agent_llama` and `react_llama` strategies) in the OPEA AgentQnA example on CRAG music domain dataset (373 questions in total). We used `meta-llama/Meta-Llama-3.1-70B-Instruct` and we served the LLM with tgi-gaudi on 4 Intel Gaudi2 accelerator cards. Refer to the docker compose yaml files in the AgentQnA example for more details on the configurations.

For the tests of conventional RAG, we used the script in the `run_benchmark` folder: `run_conv_rag.sh`. And we used the same LLM, serving configs and generation parameters as the RAG agent.

The Conventional RAG and Single RAG agent use the same retriever. The Hierarchical ReAct agent uses the Single RAG agent as its retrieval tool and also has access to CRAG APIs provided by Meta as part of the CRAG benchmark.


|Setup |Mean RAGAS `answer_correctness` score|
|----------------|------------------------------|
|Conventional RAG|0.42|
|Single RAG agent|0.43|
|Hierarchical ReAct agent|0.53|

From the results, we can see that the single RAG agent performs better than conventional RAG, while the hierarchical ReAct agent has the highest `answer_correctness` score. The reasons for such performance improvements:
1. RAG agent rewrites query and checks the quality of retrieved documents before feeding the docs to generation. It can get docs that are more relevant to generate answers. It can also decompose complex questions into modular tasks and get related docs for each task and then aggregate info to come up with answers.
2. Hierarchical ReAct agent was supplied with APIs to get information from knowledge graphs, and thus can supplement info to the knowledge in the retrieval vector database. So it can answer questions where conventional RAG or Single RAG agent cannot due to the lack of relevant info in vector database.

Note: The performance result for the hierarchical ReAct agent is with tool selection, i.e., only give a subset of tools to agent based on query, which we found can boost agent performance when the number of tools is large. However, currently OPEA agents do not support tool selection yet. We are in the process of enabling tool selection.

### Comparison with GPT-4o-mini
Open-source LLM serving libraries (tgi and vllm) have limited capabilities in producing tool-call objects. Although vllm improved its tool-calling capabilities recently, parallel tool calling is still not well supported. Therefore, we had to write our own prompts and output parsers for the `rag_agent_llama` and `react_llama` strategies for using open-source LLMs served with open-source serving frameworks for OPEA agent microservices.

Below we show the comparisons of `meta-llama/Meta-Llama-3.1-70B-Instruct` versus OpenAI's `gpt-4o-mini-2024-07-18` on 20 sampled queries from the CRAG music domain dataset. We used human evaluation criteria outlined above. The numbers are the average scores graged by human. The parenthesis denotes the OPEA agent strategy used.

|Setup|Llama3.1-70B-Instruct|gpt-4o-mini|
|-----|---------------------|-----------|
|Conventional RAG|0.15|0.05|
|Single RAG agent|0.45 (`rag_agent_llama`)|0.65 (`rag_agent`)|
|Hierarchical ReAct agent|0.55 (`react_llama`)|0.75 (`react_langgraph`)|

From the comparisons on this small subset, we can see that OPEA agents using `meta-llama/Meta-Llama-3.1-70B-Instruct` with calibrated prompt templates and output parsers are only slightly behind `gpt-4o-mini-2024-07-18` with proprietary tool-calling capabilities.
3 changes: 2 additions & 1 deletion evals/evaluation/agent_eval/crag_eval/docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
git \
poppler-utils \
libmkl-dev \
curl
curl \
nano

COPY requirements.txt /home/user/requirements.txt

Expand Down
3 changes: 2 additions & 1 deletion evals/evaluation/agent_eval/crag_eval/docker/build_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@
dockerfile=Dockerfile

docker build \
--no-cache \
-f ${dockerfile} . \
-t crag-eval:latest \
-t crag-eval:v1.1 \
--network=host \
--build-arg http_proxy=${http_proxy} \
--build-arg https_proxy=${https_proxy} \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@
volume=$WORKDIR
host_ip=$(hostname -I | awk '{print $1}')

docker run -it -v $volume:/home/user/ -e WORKDIR=/home/user -e HF_HOME=/home/user/hf_cache -e host_ip=$host_ip -e http_proxy=$http_proxy -e https_proxy=$https_proxy crag-eval:latest
docker run -it --name crag_eval -v $volume:/home/user/ -e WORKDIR=/home/user -e HF_HOME=/home/user/hf_cache -e host_ip=$host_ip -e http_proxy=$http_proxy -e https_proxy=$https_proxy crag-eval:v1.1
2 changes: 2 additions & 0 deletions evals/evaluation/agent_eval/crag_eval/docker/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ evaluate
jieba
langchain-community
langchain-huggingface
langchain-openai
nltk
pandas
ragas
sentence_transformers
Loading

0 comments on commit 9d74467

Please sign in to comment.