-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(llmobs): implement answer relevancy ragas metric #11738
Conversation
…dd-trace-py into evan.li/remaining-ragas-metrics
…dd-trace-py into evan.li/answer-relevancy
|
Datadog ReportBranch report: ✅ 0 Failed, 558 Passed, 910 Skipped, 7m 20.85s Total duration (28m 50.51s time saved) |
BenchmarksBenchmark execution time: 2024-12-16 16:02:25 Comparing candidate commit 812a162 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 394 metrics, 2 unstable metrics. |
Raises: NotImplementedError if the ragas library is not found or if ragas version is not supported. | ||
""" | ||
super().__init__(llmobs_service) | ||
self.ragas_answer_relevancy_instance = self._get_answer_relevancy_instance() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dumb question - what is the purpose of having a different LLM instance per eval metric runner? Are they not all references to the same base OpenAI() LLM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each eval metric runner has access to one instance of a ragas metric (answer_relevancy, context_precision, faithfulness) each of these ragas metrics have a seperate llm
attribute. We maintain a reference to the ragas metric, not the llm attribute
but yea, the default is openai for all of them
question = cp_inputs["question"] | ||
contexts = cp_inputs["contexts"] | ||
answer = cp_inputs["answer"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before, doesn't seem necessary to explicitly separate into new variables since they're only called once immediately at most
@@ -52,7 +54,7 @@ def __init__(self, interval: float, llmobs_service=None, evaluators=None): | |||
if evaluator in SUPPORTED_EVALUATORS: | |||
evaluator_init_state = "ok" | |||
try: | |||
self.evaluators.append(SUPPORTED_EVALUATORS[evaluator](llmobs_service=llmobs_service)) | |||
self.evaluators.append(SUPPORTED_EVALUATORS[evaluator](llmobs_service=llmobs_service)) # noqa: E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this fmt comment required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was a error being raised by mypy due to the use of ABC
. I think it was a bug since it only appeared when we have 3 or more evaluators (see: python/mypy#13044). But we are not using an abstract class for BaseRagasEvaluator
anyway, so no need for this anymore
Implements answer relevancy metric for ragas integration.
About Answer Relevancy
Answer relevancy metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the retrieved contexts and the answer.
The Answer Relevancy is defined as the mean cosine similarity of the original question to a number of artificial questions, which where generated (reverse engineered) based on the response.
Example trace
Checklist
Reviewer Checklist