Skip to content

Commit

Permalink
[pre-commit.ci] auto fixes from pre-commit.com hooks
Browse files Browse the repository at this point in the history
for more information, see https://pre-commit.ci
  • Loading branch information
pre-commit-ci[bot] committed Dec 2, 2024
1 parent 8586803 commit a1fe751
Show file tree
Hide file tree
Showing 8 changed files with 77 additions and 52 deletions.
14 changes: 5 additions & 9 deletions evals/evaluation/agent_eval/crag_eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,11 +127,11 @@ bash run_grading.sh
```

### Validation of LLM-as-judge
We validated RAGAS answer correctness as the metric to evaluate agents. We sampled 92 queries from the full music domain dataset (up to 5 questions per sub-category for all 32 sub-categories), and conducted human evaluations on the conventional RAG answers, the single RAG agent answers and the hierachical ReAct agent answers of the 92 queries.
We validated RAGAS answer correctness as the metric to evaluate agents. We sampled 92 queries from the full music domain dataset (up to 5 questions per sub-category for all 32 sub-categories), and conducted human evaluations on the conventional RAG answers, the single RAG agent answers and the hierarchical ReAct agent answers of the 92 queries.

We followed the criteria in the [CRAG paper](https://arxiv.org/pdf/2406.04744) to get human scores:
1. score 1 if the answer matches the golden answer or semantically similar.
2. score 0 if the asnwer misses information, or is "I don't know", “I’m sorry I can’t find ...”, a system error such as recursion limit is hit, or a request from the system to clarify the original question.
2. score 0 if the answer misses information, or is "I don't know", “I’m sorry I can’t find ...”, a system error such as recursion limit is hit, or a request from the system to clarify the original question.
3. score -1 if the answer contains incorrect information.

On the other hand, RAGAS `answer_correctness` score is on a scale of 0-1 and is a weighted average of 1) an F1 score and 2) similarity between answer and golden answer. The F1 score is based on the number of statements in the answer supported or not supported by the golden answer, and the number of statements in the golden answer appeared or did not appear in the answer. Please refer to [RAGAS source code](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py) for the implementation of its `answer_correctness` score. We ran RAGAS on Intel Gaudi2 accelerators. We used `meta-llama/Meta-Llama-3.1-70B-Instruct` as the LLM judge.
Expand All @@ -140,7 +140,7 @@ On the other hand, RAGAS `answer_correctness` score is on a scale of 0-1 and is
|----------------|-----------|------------------------------|
|Conventional RAG|0.05 |0.37|
|Single RAG agent|0.18 |0.43|
|Hierachical ReAct agent|0.22|0.54|
|Hierarchical ReAct agent|0.22|0.54|

We can see that the human scores and the RAGAS `answer_correctness` scores follow the same trend, although the two scoring methods used different grading criteria and methods. Since LLM-as-judge is more scalable for larger datasets, we decided to use RAGAS `answer_correctness` scores (produced by `meta-llama/Meta-Llama-3-70B-Instruct` as the LLM judge) for the evaluation of OPEA agents on the full CRAG music domain dataset.

Expand All @@ -159,7 +159,7 @@ The Conventional RAG and Single RAG agent use the same retriever. The Hierarchic
|----------------|------------------------------|
|Conventional RAG|0.42|
|Single RAG agent|0.43|
|Hierachical ReAct agent|0.53|
|Hierarchical ReAct agent|0.53|

From the results, we can see that the single RAG agent performs better than conventional RAG, while the hierarchical ReAct agent has the highest `answer_correctness` score. The reasons for such performance improvements:
1. RAG agent rewrites query and checks the quality of retrieved documents before feeding the docs to generation. It can get docs that are more relevant to generate answers. It can also decompose complex questions into modular tasks and get related docs for each task and then aggregate info to come up with answers.
Expand All @@ -170,7 +170,7 @@ Note: The performance result for the hierarchical ReAct agent is with tool selec
### Comparison with GPT-4o-mini
Open-source LLM serving libraries (tgi and vllm) have limited capabilities in producing tool-call objects. Although vllm improved its tool-calling capabilities recently, parallel tool calling is still not well supported. Therefore, we had to write our own prompts and output parsers for the `rag_agent_llama` and `react_llama` strategies for using open-source LLMs served with open-source serving frameworks for OPEA agent microservices.

Below we show the comparisons of `meta-llama/Meta-Llama-3.1-70B-Instruct` versus OpenAI's `gpt-4o-mini-2024-07-18` on 20 sampled queries from the CRAG music domain dataset. We used human evaluation criteria outlined above. The numbers are the average scores graged by human. The parathesis denotes the OPEA agent strategy used.
Below we show the comparisons of `meta-llama/Meta-Llama-3.1-70B-Instruct` versus OpenAI's `gpt-4o-mini-2024-07-18` on 20 sampled queries from the CRAG music domain dataset. We used human evaluation criteria outlined above. The numbers are the average scores graged by human. The parenthesis denotes the OPEA agent strategy used.

|Setup|Llama3.1-70B-Instruct|gpt-4o-mini|
|-----|---------------------|-----------|
Expand All @@ -179,7 +179,3 @@ Below we show the comparisons of `meta-llama/Meta-Llama-3.1-70B-Instruct` versus
|Hierarchical ReAct agent|0.55 (`react_llama`)|0.75 (`react_langgraph`)|

From the comparisons on this small subset, we can see that OPEA agents using `meta-llama/Meta-Llama-3.1-70B-Instruct` with calibrated prompt templates and output parsers are only slightly behind `gpt-4o-mini-2024-07-18` with proprietary tool-calling capabilities.




Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ jieba
langchain-community
langchain-huggingface
langchain-openai
nltk
pandas
ragas
sentence_transformers
nltk
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
import pandas as pd
from scipy.stats import spearmanr, pearsonr
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import argparse

import pandas as pd
from scipy.stats import pearsonr, spearmanr


def get_args():
parser = argparse.ArgumentParser()
Expand All @@ -12,10 +16,11 @@ def get_args():
parser.add_argument("--human_scores_file", type=str, help="file with human scores for 3 setups")
return parser.parse_args()


def merge_and_get_stats(filedir, conv_rag, ragagent, reactagent, prefix=""):
conv_rag_df = pd.read_csv(filedir+conv_rag)
ragagent_df = pd.read_csv(filedir+ragagent)
reactagent_df = pd.read_csv(filedir+reactagent)
conv_rag_df = pd.read_csv(filedir + conv_rag)
ragagent_df = pd.read_csv(filedir + ragagent)
reactagent_df = pd.read_csv(filedir + reactagent)

conv_rag_df = conv_rag_df.rename(columns={"answer_correctness": "conv_rag_score"})
ragagent_df = ragagent_df.rename(columns={"answer_correctness": "ragagent_score"})
Expand All @@ -24,7 +29,7 @@ def merge_and_get_stats(filedir, conv_rag, ragagent, reactagent, prefix=""):
merged_df = pd.merge(merged_df, reactagent_df, on="query")
print(merged_df.shape)
print(merged_df.describe())
merged_df.to_csv(filedir+prefix+"merged_scores.csv", index=False)
merged_df.to_csv(filedir + prefix + "merged_scores.csv", index=False)

# drop rows with nan
merged_df_dropped = merged_df.dropna()
Expand All @@ -33,7 +38,7 @@ def merge_and_get_stats(filedir, conv_rag, ragagent, reactagent, prefix=""):

# compare scores
print(merged_df_dropped.describe())
merged_df_dropped.to_csv(filedir+prefix+"merged_scores_nadropped.csv", index=False)
merged_df_dropped.to_csv(filedir + prefix + "merged_scores_nadropped.csv", index=False)
return merged_df, merged_df_dropped


Expand All @@ -45,22 +50,36 @@ def merge_and_get_stats(filedir, conv_rag, ragagent, reactagent, prefix=""):
reactagent = args.reactagent
human_scores_file = args.human_scores_file

#RAGAS scores
# RAGAS scores
print("===============RAGAS scores==================")
merged_df, merged_df_dropped = merge_and_get_stats(filedir, conv_rag, ragagent, reactagent)

# human scores
print("===============Human scores==================")
human_scores_df = pd.read_csv(filedir+human_scores_file)
human_scores_df = pd.read_csv(filedir + human_scores_file)
print(human_scores_df.describe())

human_scores_df_dropped = human_scores_df.loc[human_scores_df["query"].isin(merged_df_dropped["query"])]
print(human_scores_df_dropped.describe())
human_scores_df_dropped.to_csv(filedir+"human_scores_dropped.csv", index=False)
human_scores_df_dropped.to_csv(filedir + "human_scores_dropped.csv", index=False)

# concat conv_rag, ragagent, reactagent scores in merged_df_dropped
ragas_scores = pd.concat([merged_df_dropped["conv_rag_score"], merged_df_dropped["ragagent_score"], merged_df_dropped["reactagent_score"]], axis=0)
human_scores = pd.concat([human_scores_df_dropped["conv_rag"], human_scores_df_dropped["ragagent"], human_scores_df_dropped["reactagent"]], axis=0)
ragas_scores = pd.concat(
[
merged_df_dropped["conv_rag_score"],
merged_df_dropped["ragagent_score"],
merged_df_dropped["reactagent_score"],
],
axis=0,
)
human_scores = pd.concat(
[
human_scores_df_dropped["conv_rag"],
human_scores_df_dropped["ragagent"],
human_scores_df_dropped["reactagent"],
],
axis=0,
)

# calculate spearman correlation
print("===============Spearman correlation==================")
Expand All @@ -69,6 +88,3 @@ def merge_and_get_stats(filedir, conv_rag, ragagent, reactagent, prefix=""):
# pearson correlation
print("===============Pearson correlation==================")
print(pearsonr(ragas_scores, human_scores))



Original file line number Diff line number Diff line change
@@ -1,26 +1,32 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import argparse
import json
import os

import pandas as pd
import requests


def get_test_dataset(args):
filepath = os.path.join(args.filedir, args.filename)
if filepath.endswith('.jsonl'):
if filepath.endswith(".jsonl"):
df = pd.read_json(filepath, lines=True, convert_dates=False)
elif filepath.endswith('.csv'):
elif filepath.endswith(".csv"):
df = pd.read_csv(filepath)
else:
raise ValueError("Invalid file format")
return df


def save_results(output_file, output_list):
with open(output_file, "w") as f:
for output in output_list:
f.write(json.dumps(output))
f.write("\n")


def save_as_csv(output):
df = pd.read_json(output, lines=True, convert_dates=False)
df.to_csv(output.replace(".jsonl", ".csv"), index=False)
Expand Down Expand Up @@ -62,6 +68,7 @@ def search_knowledge_base(query: str) -> str:
else:
return "Error parsing response from the knowledge base."


PROMPT = """\
### You are a helpful, respectful and honest assistant.
You are given a Question and the time when it was asked in the Pacific Time Zone (PT), referred to as "Query
Expand All @@ -78,8 +85,10 @@ def search_knowledge_base(query: str) -> str:
### Answer:
"""


def setup_chat_model(args):
from langchain_openai import ChatOpenAI

params = {
"temperature": args.temperature,
"max_tokens": args.max_new_tokens,
Expand All @@ -95,10 +104,11 @@ def setup_chat_model(args):
)
return llm


def generate_answer(llm, query, context, time):
prompt = PROMPT.format(context=context, question=query, time=time)
response = llm.invoke(prompt)
return response.content
return response.content


if __name__ == "__main__":
Expand Down Expand Up @@ -130,7 +140,7 @@ def generate_answer(llm, query, context, time):
print("========== Query: ", q)
context = search_knowledge_base(q)
print("========== Context:\n", context)
answer = generate_answer(llm, q, context, t)
answer = generate_answer(llm, q, context, t)
print("========== Answer:\n", answer)
contexts.append(context)
output_list.append(
Expand All @@ -146,6 +156,3 @@ def generate_answer(llm, query, context, time):
save_results(args.output, output_list)

save_as_csv(args.output)



Original file line number Diff line number Diff line change
Expand Up @@ -79,14 +79,14 @@ def grade_answers(args, test_case):
# print(test_case)

scores = grade_answers(args, test_case)
#print(scores)
# print(scores)

# save the scores
if args.batch_grade:
print("Aggregated answer correctness score: ", scores)
else:
data["answer_correctness"] = scores
output_file = args.filename.replace(".csv", "_graded.csv")
output_file = args.filename.replace(".csv", "_graded.csv")
data.to_csv(os.path.join(args.filedir, output_file), index=False)
print("Scores saved to ", os.path.join(args.filedir, output_file))

Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

filedir=$WORKDIR/datasets/crag_results/
conv_rag="conv_rag_graded.csv" # replace with your file name
ragagent="ragagent_graded.csv" # replace with your file name
Expand All @@ -9,4 +12,4 @@ python3 compare_scores.py \
--conv_rag $conv_rag \
--ragagent $ragagent \
--reactagent $reactagent \
--human_scores_file $human_scores_file
--human_scores_file $human_scores_file
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

MODEL="meta-llama/Meta-Llama-3.1-70B-Instruct"
LLMENDPOINT=http://${host_ip}:8085

Expand All @@ -12,4 +15,4 @@ python3 conventional_rag.py \
--llm_endpoint_url ${LLMENDPOINT} \
--filedir ${FILEDIR} \
--filename ${FILENAME} \
--output ${OUTPUT}
--output ${OUTPUT}
32 changes: 16 additions & 16 deletions evals/metrics/ragas/ragas.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@
# from ragas.metrics import *
from ragas import evaluate
from ragas.metrics import (
answer_correctness,
answer_relevancy,
answer_similarity,
context_precision,
context_recall,
faithfulness,
)
answer_correctness,
answer_relevancy,
answer_similarity,
context_precision,
context_recall,
faithfulness,
)
except ModuleNotFoundError:
raise ModuleNotFoundError("Please install ragas to use this metric. `pip install ragas`.")

Expand All @@ -41,13 +41,13 @@
]

metrics_mapping = {
"answer_correctness": answer_correctness,
"answer_relevancy": answer_relevancy,
"answer_similarity": answer_similarity,
"context_precision": context_precision,
"context_recall": context_recall,
"faithfulness": faithfulness,
}
"answer_correctness": answer_correctness,
"answer_relevancy": answer_relevancy,
"answer_similarity": answer_similarity,
"context_precision": context_precision,
"context_recall": context_recall,
"faithfulness": faithfulness,
}


def format_ragas_metric_name(name: str):
Expand Down Expand Up @@ -82,7 +82,7 @@ def __init__(
else:
print("Accepting user-initialized model as we could not detect OpenAI key or HuggingFace Endpoint URL.")
self.chat_model = self.model

if self.metrics is not None:
tmp_metrics = []
# check supported list
Expand Down Expand Up @@ -131,7 +131,7 @@ def __init__(
async def a_measure(self, test_case: Dict):
return self.measure(test_case)

def measure(self, test_case: Dict):
def measure(self, test_case: Dict):
# get only necessary columns from test case
data = {column: test_case[column] for column in self._required_columns}
dataset = Dataset.from_dict(data)
Expand Down

0 comments on commit a1fe751

Please sign in to comment.