Skip to content

Commit

Permalink
legal doc summarizer
Browse files Browse the repository at this point in the history
  • Loading branch information
kritinv committed Jan 29, 2025
1 parent f676815 commit 98cbe4d
Show file tree
Hide file tree
Showing 5 changed files with 327 additions and 23 deletions.
56 changes: 33 additions & 23 deletions docs/sidebarTutorials.js
Original file line number Diff line number Diff line change
@@ -1,24 +1,34 @@
module.exports = {
tutorials: [
"tutorial-introduction",
"tutorial-setup",
{
type: "category",
label: "Medical Chatbot",
items: [
"tutorial-llm-application-example",
"tutorial-metrics-defining-an-evaluation-criteria",
"tutorial-metrics-selection",
"tutorial-evaluations-running-an-evaluation",
"tutorial-evaluations-hyperparameters",
"tutorial-evaluations-catching-regressions",
"tutorial-dataset-synthesis",
"tutorial-dataset-confident",
"tutorial-production-monitoring",
"tutorial-production-evaluation"
],
collapsed: false,
},
],
};

tutorials: [
"tutorial-introduction",
"tutorial-setup",
{
type: "category",
label: "Legal Document Summarizer",
items: [
"legal-doc-summarizer-introduction",
"legal-doc-summarizer-defining-a-summarization-criteria",
"legal-doc-summarizer-selecting-your-metrics",
"legal-doc-summarizer-running-an-evaluation",
],
collapsed: false,
},
{
type: "category",
label: "Medical Chatbot",
items: [
"tutorial-llm-application-example",
"tutorial-metrics-defining-an-evaluation-criteria",
"tutorial-metrics-selection",
"tutorial-evaluations-running-an-evaluation",
"tutorial-evaluations-hyperparameters",
"tutorial-evaluations-catching-regressions",
"tutorial-dataset-synthesis",
"tutorial-dataset-confident",
"tutorial-production-monitoring",
"tutorial-production-evaluation",
],
collapsed: false,
},
],
};
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
id: legal-doc-summarizer-defining-a-summarization-criteria
title: Defining an Evaluation Criteria for Summarization
sidebar_label: Define an Evaluation Criteria
---

Before selecting your metrics, you'll first need to **define your evaluation criteria**. In other words, identify what aspects of the summaries generated by your LLM matter to you—what makes a summary bad and what makes it good. This will shape the criteria you use to evaluate your LLM.

:::tip
A well-defined evaluation criterion makes it _easier_ to choose the **right metrics** for assessing your LLM summarizer.
:::

For example, if clarity is a priority when summarizing lengthy and complex legal documents, you should choose metrics like conciseness, which assess how easily the summaries can be understood.

## Generating Dummy Summaries

If you don't already have an evaluation criteria, generating summaries from a few randomly selected documents can help you identify which aspects matter most to you. For example, consider this service agreement contract (approximately three pages), which has been shortened for the sake of this example:

```python
document_content = """
CONTRACT FOR SERVICES
This Service Agreement ("Agreement") is entered into on January 28, 2025, by and between Acme Solutions, Inc. ("Provider"), a corporation registered in Delaware, and BetaCorp LLC ("Client"), a limited liability company registered in California.
1. SERVICES: Provider agrees to perform software development and consulting services for Client as outlined in Exhibit A. Services will commence on February 1, 2025, and are expected to conclude by August 1, 2025, unless extended in writing.
2. COMPENSATION: Client shall pay Provider a fixed fee of $50,000, payable in five equal installments of $10,000 due on the first of each month starting February 1, 2025.
...
...
...
Signed,
Acme Solutions, Inc.
BetaCorp LLC
"""
```

Let's run the following code to generate the summary:

```python
# replace llm with your LLM summarizer
summary = llm.summarize(document_content)
print(summary)
```

This yields the following results:

```
This agreement establishes a business relationship between Acme Solutions, Inc.,
a Delaware corporation, and BetaCorp LLC, a company based in California.
It specifies that Acme Solutions will provide software development and consulting
services to BetaCorp for a defined period, beginning February 1, 2025, and
potentially ending on August 1, 2025. The document includes details about the
responsibilities of each party, confidentiality obligations, and the termination
process, which requires 30 days' written notice. Additionally, it states that
California law will govern the agreement. However, no details are included regarding
the payment structure.
```

Immediately, you can see that there are **2 issues** with generated summary: first, it’s too lengthy. The legal document summarizer we’re building is designed to help lawyers work efficiently, so keeping summaries concise is essential. Second, the summary omits compensation details, which is a significant problem. A complete summary is crucial to ensure that lawyers don’t miss any vital information in their fast-paced line of work.

:::info
**Generating even more summaries** can help reveal additional issues in your LLM summarizer, such as fluency problems (especially in non-English languages) or hallucinations that didn't exist in other summary generations.
:::

## Defining Your Evaluation Criteria

From generating a single summary, we've already identified **two key points** that matter for our LLM summarizer:

- The summary must be concise.
- The summary must be complete.

These points define our **evaluation criteria**. In practice, you'll want to test your summarizer with as many documents as possible. The more examples you run, the more patterns and gaps you'll uncover, helping you refine and build a comprehensive set of evaluation criteria.

:::note
Your evaluation criteria are **not set in stone**. As your LLM application moves into production, ongoing **user feedback** will be essential for refining your evaluation criteria—which ultimately matters more than your initial priorities.

:::

Next, let’s explore how to go from evaluation criteria to choosing metrics.
48 changes: 48 additions & 0 deletions docs/tutorials/legal-doc-summarizer-introduction.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
id: legal-doc-summarizer-introduction
title: Introduction
sidebar_label: Introduction
---

In this tutorial, we'll go through the entire process of evaluating a legal document summarizer, from choosing your metrics to running evaluations and monitoring performance in production.

:::tip
If you're working with LLMs for summarization, this tutorial is for you. While we focus on evaluating a legal document summarizer, the concepts apply to **any LLM application that generates summaries**.

:::

We'll cover:

- How to define summarization criteria
- Selecting the right summarization metrics
- Running evaluations on your summarizer
- Iterating on your summarizer’s hyperparameters
- Monitoring and evaluating LLM summarizers in production

:::note
Before we begin, make sure you're logged into Confident AI. If you haven’t set up your account yet, visit the [setting up section](tutorial-setup).

```
deepeval login
```

:::

## Legal Document Summarizer

The LLM summarizer application we'll be evaluating is designed to **extract key points** from legal texts while maintaining their _original intent_. This ensures that important clauses, obligations, and legal nuances are preserved without unnecessary details or misinterpretation.

:::info
We'll be using GPT-3.5 to generate these summaries. Below is the prompt template that will be guiding the model:

```
You are an AI assistant tasked with summarizing legal documents
concisely and accurately. Given the following legal text, generate
a summary that captures the key points while avoiding unnecessary
details. Ensure neutrality and refrain from interpreting beyond the
provided text.
```

:::

Now that we've established the context for our summarizer, let's move on to [defining our evaluation criteria](legal-doc-summarizer-defining-a-summarization-criteria) in the next section.
98 changes: 98 additions & 0 deletions docs/tutorials/legal-doc-summarizer-running-an-evaluation.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
id: legal-doc-summarizer-running-an-evaluation
title: Running an Evaluation
sidebar_label: Running an Evaluation
---

Before running evaluations, we need to **construct a dataset** with the documents we want to summarize and generate summaries for using our LLM summarizer. This will allow us to apply our metrics directly to the dataset when running our evaluations.

:::caution important
You'll want to login to Confident AI before running an evaluation to enable data persistence.

```python
deepeval login
```

:::

## Constructing a Dataset

If you're building a document summarizer, you likely have a folder of PDFs ready to be processed. First, you'll want to parse these PDFs into strings that can be passed into your LLM summarizer.

```python
import os
import PyPDF2

def extract_text(pdf_path):
"""Extract text from a PDF file."""
with open(pdf_path, "rb") as file:
reader = PyPDF2.PdfReader(file)
text = "\n".join(page.extract_text() for page in reader.pages if page.extract_text())
return text

# Replace with your folder containing PDFs
pdf_folder = "path/to/pdf/folder"
documents = [] # List to store extracted document strings

# Iterate over PDF files in the folder
for pdf_file in os.listdir(pdf_folder):
if pdf_file.endswith(".pdf"):
pdf_path = os.path.join(pdf_folder, pdf_file)
document_text = extract_text(pdf_path)
documents.append(document_text) # Store extracted text
```

Next, we'll call our legal document summarizer `llm.summarize()` on the extracted document texts to generate the summaries for our evaluation dataset. You should replace this function with your actual summarizer.

```python
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from some_llm_library import llm # Replace with the actual LLM library

# Convert document strings to test cases with LLM summaries
test_cases = [LLMTestCase(input=doc, actual_output=llm.summarize(doc)) for doc in documents]

# Create the evaluation dataset
dataset = EvaluationDataset(test_cases=test_cases)
```

:::info
An `EvaluationDataset` consists of a series of test cases. Each test case contains an `input`, which represents the document we feed into the summarizer, and the `actual_output`, which is the summary generated by the LLM. [More on test cases here](/docs/evaluation-test-cases).
:::

Keep in mind that, for the sake of this tutorial, our `EvaluationDataset` consists of 5 test cases, and our first `test_case` corresponds to the service agreement we inspected when we first [defined our evaluation criteria](legal-doc-summarizer-defining-a-summarization-criteria).

```python
print(dataset.test_cases[0].input)
#CONTRACT FOR SERVICES...
print(dataset.test_cases[0].actual_output)
#This agreement establishes...
print(len(dataset.test_cases))
# 5
```

Now that the dataset is ready, we can finally begin running our first evaluation.

## Running an Evaluation

To run an evaluation, first login to Confident AI.

```
deepeval login
```

Then, pass the metrics we defined in the [previous section](legal-doc-summarizer-selecting-your-metrics) along with the dataset we created into the `evaluate` function.

```python
from deepeval.evaluate

evaluate(dataset, metrics=[concision_metric, completeness_metric])
```

:::tip
The `evaluate` function offers flexible customization for how you want to run evaluations, allowing you to control concurrency for asynchronous operations and manage error handling. Learn more about these options [here](#).
:::

## Analyzing your Test Report

Once your evaluation is complete, you'll be redirected to a Confident AI page displaying the testing report for the five document summaries we defined and evaluated earlier. Each test case includes its status (pass or fail), input (document), and actual output (summary).
68 changes: 68 additions & 0 deletions docs/tutorials/legal-doc-summarizer-selecting-your-metrics.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
id: legal-doc-summarizer-selecting-your-metrics
title: Selecting the Right Summarization Metrics
sidebar_label: Selecting Summarization Metrics
---

Having a clear **evaluation criteria makes selecting the right summarization metrics easy**. This is because DeepEval's `GEval` metric allows you to define custom summarization metrics simply by providing your evaluation criteria. If you've followed the steps in the previous section, this is as simple as pasting in or rewording your criteria.

For example, if your goal is to generate concise summaries, your evaluation criteria might assess whether the summary remains concise while preserving all essential information from the document. Here's how you can create a custom Concision `GEval` metric:

```python
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

concision_metric = GEval(
name="Concision",
criteria="Assess if the actual output remains concise while preserving all essential information.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)
```

:::info
When defining `GEval`, you must provide `evaluation_params`, which specify the test case parameters to evaluate. [More information here.](/docs/metrics-llm-evals)
:::

## Selecting Metrics from Evaluation Criteria

Let's now begin **selecting the metrics** for our legal document summarizer. First, we'll need to revisit the evaluation criteria we defined in the previous section:

1. The summary must be concise.
2. The summary must be complete.

The first criterion specifies that the _summary must be concise_. Fortunately, we've already defined a **Concision metric** in the section above, which we can put to use. The second criterion states that the summary must be complete, ensuring no important information is lost from the original document.

:::tip
While brief evaluation criteria (e.g., "the actual output must be concise") are acceptable, it's generally better to be more **specific** to ensure consistent metric scoring, especially given the non-deterministic nature of LLMs.
:::

### Defining Completeness Metric

This means, instead of saying "the summary must be complete," we might specify, "the summary must retain all important information from the document." Here's how we can define a `GEval` metric for **Completeness**:

```python
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

completeness_metric = GEval(
name="Completeness",
criteria="Assess whether the actual output retains all key information from the input.",
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)
```

:::note
`LLMTestCaseParams.INPUT` refers to your document (input to your LLM summarizer), while `LLMTestCaseParams.ACTUAL_OUTPUT` is the summary (generated output). [More information available here.](/docs/metrics-llm-evals)
:::

### Conclusion

With our two metrics for concision and completeness defined, let's begin running evaluations in the next section.

:::info Additional Tips
DeepEval offers a built-in `Summarization` metric. Other useful summarization metrics you can define with `GEval` include:

- **Fluency** – Ensures the summary is grammatically correct and natural.
- **Coherence** – Checks if the summary flows logically and maintains readability.
- **Hallucination** – Identifies incorrect or fabricated information.
- **Factual Consistency** – Ensures the summary accurately reflects the source content.

0 comments on commit 98cbe4d

Please sign in to comment.