-
Notifications
You must be signed in to change notification settings - Fork 388
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
5 changed files
with
327 additions
and
23 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,34 @@ | ||
module.exports = { | ||
tutorials: [ | ||
"tutorial-introduction", | ||
"tutorial-setup", | ||
{ | ||
type: "category", | ||
label: "Medical Chatbot", | ||
items: [ | ||
"tutorial-llm-application-example", | ||
"tutorial-metrics-defining-an-evaluation-criteria", | ||
"tutorial-metrics-selection", | ||
"tutorial-evaluations-running-an-evaluation", | ||
"tutorial-evaluations-hyperparameters", | ||
"tutorial-evaluations-catching-regressions", | ||
"tutorial-dataset-synthesis", | ||
"tutorial-dataset-confident", | ||
"tutorial-production-monitoring", | ||
"tutorial-production-evaluation" | ||
], | ||
collapsed: false, | ||
}, | ||
], | ||
}; | ||
|
||
tutorials: [ | ||
"tutorial-introduction", | ||
"tutorial-setup", | ||
{ | ||
type: "category", | ||
label: "Legal Document Summarizer", | ||
items: [ | ||
"legal-doc-summarizer-introduction", | ||
"legal-doc-summarizer-defining-a-summarization-criteria", | ||
"legal-doc-summarizer-selecting-your-metrics", | ||
"legal-doc-summarizer-running-an-evaluation", | ||
], | ||
collapsed: false, | ||
}, | ||
{ | ||
type: "category", | ||
label: "Medical Chatbot", | ||
items: [ | ||
"tutorial-llm-application-example", | ||
"tutorial-metrics-defining-an-evaluation-criteria", | ||
"tutorial-metrics-selection", | ||
"tutorial-evaluations-running-an-evaluation", | ||
"tutorial-evaluations-hyperparameters", | ||
"tutorial-evaluations-catching-regressions", | ||
"tutorial-dataset-synthesis", | ||
"tutorial-dataset-confident", | ||
"tutorial-production-monitoring", | ||
"tutorial-production-evaluation", | ||
], | ||
collapsed: false, | ||
}, | ||
], | ||
}; |
80 changes: 80 additions & 0 deletions
80
docs/tutorials/legal-doc-summarizer-defining-a-summarization-criteria.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
--- | ||
id: legal-doc-summarizer-defining-a-summarization-criteria | ||
title: Defining an Evaluation Criteria for Summarization | ||
sidebar_label: Define an Evaluation Criteria | ||
--- | ||
|
||
Before selecting your metrics, you'll first need to **define your evaluation criteria**. In other words, identify what aspects of the summaries generated by your LLM matter to you—what makes a summary bad and what makes it good. This will shape the criteria you use to evaluate your LLM. | ||
|
||
:::tip | ||
A well-defined evaluation criterion makes it _easier_ to choose the **right metrics** for assessing your LLM summarizer. | ||
::: | ||
|
||
For example, if clarity is a priority when summarizing lengthy and complex legal documents, you should choose metrics like conciseness, which assess how easily the summaries can be understood. | ||
|
||
## Generating Dummy Summaries | ||
|
||
If you don't already have an evaluation criteria, generating summaries from a few randomly selected documents can help you identify which aspects matter most to you. For example, consider this service agreement contract (approximately three pages), which has been shortened for the sake of this example: | ||
|
||
```python | ||
document_content = """ | ||
CONTRACT FOR SERVICES | ||
This Service Agreement ("Agreement") is entered into on January 28, 2025, by and between Acme Solutions, Inc. ("Provider"), a corporation registered in Delaware, and BetaCorp LLC ("Client"), a limited liability company registered in California. | ||
1. SERVICES: Provider agrees to perform software development and consulting services for Client as outlined in Exhibit A. Services will commence on February 1, 2025, and are expected to conclude by August 1, 2025, unless extended in writing. | ||
2. COMPENSATION: Client shall pay Provider a fixed fee of $50,000, payable in five equal installments of $10,000 due on the first of each month starting February 1, 2025. | ||
... | ||
... | ||
... | ||
Signed, | ||
Acme Solutions, Inc. | ||
BetaCorp LLC | ||
""" | ||
``` | ||
|
||
Let's run the following code to generate the summary: | ||
|
||
```python | ||
# replace llm with your LLM summarizer | ||
summary = llm.summarize(document_content) | ||
print(summary) | ||
``` | ||
|
||
This yields the following results: | ||
|
||
``` | ||
This agreement establishes a business relationship between Acme Solutions, Inc., | ||
a Delaware corporation, and BetaCorp LLC, a company based in California. | ||
It specifies that Acme Solutions will provide software development and consulting | ||
services to BetaCorp for a defined period, beginning February 1, 2025, and | ||
potentially ending on August 1, 2025. The document includes details about the | ||
responsibilities of each party, confidentiality obligations, and the termination | ||
process, which requires 30 days' written notice. Additionally, it states that | ||
California law will govern the agreement. However, no details are included regarding | ||
the payment structure. | ||
``` | ||
|
||
Immediately, you can see that there are **2 issues** with generated summary: first, it’s too lengthy. The legal document summarizer we’re building is designed to help lawyers work efficiently, so keeping summaries concise is essential. Second, the summary omits compensation details, which is a significant problem. A complete summary is crucial to ensure that lawyers don’t miss any vital information in their fast-paced line of work. | ||
|
||
:::info | ||
**Generating even more summaries** can help reveal additional issues in your LLM summarizer, such as fluency problems (especially in non-English languages) or hallucinations that didn't exist in other summary generations. | ||
::: | ||
|
||
## Defining Your Evaluation Criteria | ||
|
||
From generating a single summary, we've already identified **two key points** that matter for our LLM summarizer: | ||
|
||
- The summary must be concise. | ||
- The summary must be complete. | ||
|
||
These points define our **evaluation criteria**. In practice, you'll want to test your summarizer with as many documents as possible. The more examples you run, the more patterns and gaps you'll uncover, helping you refine and build a comprehensive set of evaluation criteria. | ||
|
||
:::note | ||
Your evaluation criteria are **not set in stone**. As your LLM application moves into production, ongoing **user feedback** will be essential for refining your evaluation criteria—which ultimately matters more than your initial priorities. | ||
|
||
::: | ||
|
||
Next, let’s explore how to go from evaluation criteria to choosing metrics. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
--- | ||
id: legal-doc-summarizer-introduction | ||
title: Introduction | ||
sidebar_label: Introduction | ||
--- | ||
|
||
In this tutorial, we'll go through the entire process of evaluating a legal document summarizer, from choosing your metrics to running evaluations and monitoring performance in production. | ||
|
||
:::tip | ||
If you're working with LLMs for summarization, this tutorial is for you. While we focus on evaluating a legal document summarizer, the concepts apply to **any LLM application that generates summaries**. | ||
|
||
::: | ||
|
||
We'll cover: | ||
|
||
- How to define summarization criteria | ||
- Selecting the right summarization metrics | ||
- Running evaluations on your summarizer | ||
- Iterating on your summarizer’s hyperparameters | ||
- Monitoring and evaluating LLM summarizers in production | ||
|
||
:::note | ||
Before we begin, make sure you're logged into Confident AI. If you haven’t set up your account yet, visit the [setting up section](tutorial-setup). | ||
|
||
``` | ||
deepeval login | ||
``` | ||
|
||
::: | ||
|
||
## Legal Document Summarizer | ||
|
||
The LLM summarizer application we'll be evaluating is designed to **extract key points** from legal texts while maintaining their _original intent_. This ensures that important clauses, obligations, and legal nuances are preserved without unnecessary details or misinterpretation. | ||
|
||
:::info | ||
We'll be using GPT-3.5 to generate these summaries. Below is the prompt template that will be guiding the model: | ||
|
||
``` | ||
You are an AI assistant tasked with summarizing legal documents | ||
concisely and accurately. Given the following legal text, generate | ||
a summary that captures the key points while avoiding unnecessary | ||
details. Ensure neutrality and refrain from interpreting beyond the | ||
provided text. | ||
``` | ||
|
||
::: | ||
|
||
Now that we've established the context for our summarizer, let's move on to [defining our evaluation criteria](legal-doc-summarizer-defining-a-summarization-criteria) in the next section. |
98 changes: 98 additions & 0 deletions
98
docs/tutorials/legal-doc-summarizer-running-an-evaluation.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
--- | ||
id: legal-doc-summarizer-running-an-evaluation | ||
title: Running an Evaluation | ||
sidebar_label: Running an Evaluation | ||
--- | ||
|
||
Before running evaluations, we need to **construct a dataset** with the documents we want to summarize and generate summaries for using our LLM summarizer. This will allow us to apply our metrics directly to the dataset when running our evaluations. | ||
|
||
:::caution important | ||
You'll want to login to Confident AI before running an evaluation to enable data persistence. | ||
|
||
```python | ||
deepeval login | ||
``` | ||
|
||
::: | ||
|
||
## Constructing a Dataset | ||
|
||
If you're building a document summarizer, you likely have a folder of PDFs ready to be processed. First, you'll want to parse these PDFs into strings that can be passed into your LLM summarizer. | ||
|
||
```python | ||
import os | ||
import PyPDF2 | ||
|
||
def extract_text(pdf_path): | ||
"""Extract text from a PDF file.""" | ||
with open(pdf_path, "rb") as file: | ||
reader = PyPDF2.PdfReader(file) | ||
text = "\n".join(page.extract_text() for page in reader.pages if page.extract_text()) | ||
return text | ||
|
||
# Replace with your folder containing PDFs | ||
pdf_folder = "path/to/pdf/folder" | ||
documents = [] # List to store extracted document strings | ||
|
||
# Iterate over PDF files in the folder | ||
for pdf_file in os.listdir(pdf_folder): | ||
if pdf_file.endswith(".pdf"): | ||
pdf_path = os.path.join(pdf_folder, pdf_file) | ||
document_text = extract_text(pdf_path) | ||
documents.append(document_text) # Store extracted text | ||
``` | ||
|
||
Next, we'll call our legal document summarizer `llm.summarize()` on the extracted document texts to generate the summaries for our evaluation dataset. You should replace this function with your actual summarizer. | ||
|
||
```python | ||
from deepeval.test_case import LLMTestCase | ||
from deepeval.dataset import EvaluationDataset | ||
from some_llm_library import llm # Replace with the actual LLM library | ||
|
||
# Convert document strings to test cases with LLM summaries | ||
test_cases = [LLMTestCase(input=doc, actual_output=llm.summarize(doc)) for doc in documents] | ||
|
||
# Create the evaluation dataset | ||
dataset = EvaluationDataset(test_cases=test_cases) | ||
``` | ||
|
||
:::info | ||
An `EvaluationDataset` consists of a series of test cases. Each test case contains an `input`, which represents the document we feed into the summarizer, and the `actual_output`, which is the summary generated by the LLM. [More on test cases here](/docs/evaluation-test-cases). | ||
::: | ||
|
||
Keep in mind that, for the sake of this tutorial, our `EvaluationDataset` consists of 5 test cases, and our first `test_case` corresponds to the service agreement we inspected when we first [defined our evaluation criteria](legal-doc-summarizer-defining-a-summarization-criteria). | ||
|
||
```python | ||
print(dataset.test_cases[0].input) | ||
#CONTRACT FOR SERVICES... | ||
print(dataset.test_cases[0].actual_output) | ||
#This agreement establishes... | ||
print(len(dataset.test_cases)) | ||
# 5 | ||
``` | ||
|
||
Now that the dataset is ready, we can finally begin running our first evaluation. | ||
|
||
## Running an Evaluation | ||
|
||
To run an evaluation, first login to Confident AI. | ||
|
||
``` | ||
deepeval login | ||
``` | ||
|
||
Then, pass the metrics we defined in the [previous section](legal-doc-summarizer-selecting-your-metrics) along with the dataset we created into the `evaluate` function. | ||
|
||
```python | ||
from deepeval.evaluate | ||
|
||
evaluate(dataset, metrics=[concision_metric, completeness_metric]) | ||
``` | ||
|
||
:::tip | ||
The `evaluate` function offers flexible customization for how you want to run evaluations, allowing you to control concurrency for asynchronous operations and manage error handling. Learn more about these options [here](#). | ||
::: | ||
|
||
## Analyzing your Test Report | ||
|
||
Once your evaluation is complete, you'll be redirected to a Confident AI page displaying the testing report for the five document summaries we defined and evaluated earlier. Each test case includes its status (pass or fail), input (document), and actual output (summary). |
68 changes: 68 additions & 0 deletions
68
docs/tutorials/legal-doc-summarizer-selecting-your-metrics.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
--- | ||
id: legal-doc-summarizer-selecting-your-metrics | ||
title: Selecting the Right Summarization Metrics | ||
sidebar_label: Selecting Summarization Metrics | ||
--- | ||
|
||
Having a clear **evaluation criteria makes selecting the right summarization metrics easy**. This is because DeepEval's `GEval` metric allows you to define custom summarization metrics simply by providing your evaluation criteria. If you've followed the steps in the previous section, this is as simple as pasting in or rewording your criteria. | ||
|
||
For example, if your goal is to generate concise summaries, your evaluation criteria might assess whether the summary remains concise while preserving all essential information from the document. Here's how you can create a custom Concision `GEval` metric: | ||
|
||
```python | ||
from deepeval.metrics import GEval | ||
from deepeval.test_case import LLMTestCaseParams | ||
|
||
concision_metric = GEval( | ||
name="Concision", | ||
criteria="Assess if the actual output remains concise while preserving all essential information.", | ||
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT], | ||
) | ||
``` | ||
|
||
:::info | ||
When defining `GEval`, you must provide `evaluation_params`, which specify the test case parameters to evaluate. [More information here.](/docs/metrics-llm-evals) | ||
::: | ||
|
||
## Selecting Metrics from Evaluation Criteria | ||
|
||
Let's now begin **selecting the metrics** for our legal document summarizer. First, we'll need to revisit the evaluation criteria we defined in the previous section: | ||
|
||
1. The summary must be concise. | ||
2. The summary must be complete. | ||
|
||
The first criterion specifies that the _summary must be concise_. Fortunately, we've already defined a **Concision metric** in the section above, which we can put to use. The second criterion states that the summary must be complete, ensuring no important information is lost from the original document. | ||
|
||
:::tip | ||
While brief evaluation criteria (e.g., "the actual output must be concise") are acceptable, it's generally better to be more **specific** to ensure consistent metric scoring, especially given the non-deterministic nature of LLMs. | ||
::: | ||
|
||
### Defining Completeness Metric | ||
|
||
This means, instead of saying "the summary must be complete," we might specify, "the summary must retain all important information from the document." Here's how we can define a `GEval` metric for **Completeness**: | ||
|
||
```python | ||
from deepeval.metrics import GEval | ||
from deepeval.test_case import LLMTestCaseParams | ||
|
||
completeness_metric = GEval( | ||
name="Completeness", | ||
criteria="Assess whether the actual output retains all key information from the input.", | ||
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT], | ||
) | ||
``` | ||
|
||
:::note | ||
`LLMTestCaseParams.INPUT` refers to your document (input to your LLM summarizer), while `LLMTestCaseParams.ACTUAL_OUTPUT` is the summary (generated output). [More information available here.](/docs/metrics-llm-evals) | ||
::: | ||
|
||
### Conclusion | ||
|
||
With our two metrics for concision and completeness defined, let's begin running evaluations in the next section. | ||
|
||
:::info Additional Tips | ||
DeepEval offers a built-in `Summarization` metric. Other useful summarization metrics you can define with `GEval` include: | ||
|
||
- **Fluency** – Ensures the summary is grammatically correct and natural. | ||
- **Coherence** – Checks if the summary flows logically and maintains readability. | ||
- **Hallucination** – Identifies incorrect or fabricated information. | ||
- **Factual Consistency** – Ensures the summary accurately reflects the source content. |