-
Notifications
You must be signed in to change notification settings - Fork 396
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
86d1b11
commit 1c82f98
Showing
4 changed files
with
90 additions
and
110 deletions.
There are no files selected for viewing
4 changes: 2 additions & 2 deletions
4
docs/confident-ai/confident-ai-evaluation-dataset-evaluation.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
154 changes: 87 additions & 67 deletions
154
docs/confident-ai/confident-ai-evaluation-dataset-management.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,128 +1,148 @@ | ||
--- | ||
id: confident-ai-evaluation-dataset-management | ||
title: Managing Datasets | ||
sidebar_label: Managing Datasets | ||
title: Curating Datasets | ||
sidebar_label: Curating Datasets | ||
--- | ||
|
||
## Quick Summary | ||
Confident AI provides your team a centralized place to **create, upload, and edit** evaluation datasets online. Think of it like a Google Sheets or Notion CSV editor, but the difference is that each row is already in the structure of an `LLMTestCase`, and most importantly you can use it directly in code when evaluating with `deepeval`. | ||
|
||
Confident AI provides your team a centralized place to **create, generate, upload and edit** evaluation datasets online. You can manage evaluation datasets directly on Confident AI or using `deepeval`. To begin, create a fresh dataset on Confident AI on the "Datasets" page. | ||
|
||
:::note | ||
:::info | ||
An evaluation dataset on Confident AI is a collection of goldens, which is extremely similar to a test case. You can learn more about goldens [here.](#what-is-a-golden) | ||
::: | ||
|
||
## Generate A Synthetic Dataset | ||
## Create Your First Dataset | ||
|
||
To create your first dataset, simply navigate to the **Datasets** page in your project space. There, you'll see a button that says _Create Dataset_, and you will be required to name your first dataset by providing it with an alias. This alias will be used to identify which will be used for evaluation later on. | ||
|
||
You can upload documents in your knowledge base to generate synthetic goldens that can later be used as test cases for evaluation. Under the hood, Confident AI uses various document parsing algorithms to first extract contexts before generating goldens based on these contexts through numerous [data-evolution techniques.](https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms) | ||
## Populate Your Dataset With Golden(s) | ||
|
||
Simply click on the "Generate" button to upload a document of type `.pdf`, `.txt` or `.docx` to start generating. | ||
Now that you've created a dataset, you can create a "golden" within your dataset that will later be converted to an `LLMTestCase` during evaluation time (we'll talk more about this later). There are a few ways you can populate your dataset with goldens, which includes: | ||
|
||
![ok](https://confident-bucket.s3.amazonaws.com/generate-goldens.svg) | ||
1. Creating a golden individually using Confident AI's goldens editor. | ||
2. Importing a CSV of goldens to Confident AI. | ||
3. Uploading a list of `Golden`s to Confident AI via `deepeval`. | ||
|
||
:::note | ||
If you're not sure what to include in your goldens, simply enter the inputs you're currently prompting your LLM application with when eyeballing outputs. You'll be able to automate this process by creating a list of goldens out of `input`s. | ||
::: | ||
|
||
A [golden is basically a test case](#what-is-a-golden) that isn't ready for evaluation yet. It holds additional information needed for a better dataset annotation experience, such as the ability to mark it as "ready" for evaluation, the ability to contain empty `actual_output`s that will later be populated at evaluation time, and the inclusion of additional columns and metadata that might be useful for you at evaluation time. | ||
|
||
:::caution | ||
You **MUST** set your OpenAI API Key for your project in order to generate synthetic goldens. You can set your OpenAI API Key in the "Project Details" page. | ||
We highly recommend **AGAINST** running evaluations on pre-computed evaluation datasets since you'll want to test your LLM application based on the latest `actual_output`s that are generated as a consequence of your iteration, so if you find yourself filling in the `actual_output` field at any point in time, think again. | ||
::: | ||
|
||
## Upload A Dataset | ||
### Create Individual Golden(s) | ||
|
||
Alternatively, you can also choose to upload entire datasets from CSV files. Simply click on the "Upload Goldens" button to import goldens from CSV files. | ||
You can create goldens manually by clicking on the _Create Golden_ button in the **Datasets** > **Dataset Editor** page, which will open an editor for you to fill in your golden information. | ||
|
||
![ok](https://confident-bucket.s3.amazonaws.com/upload-goldens.svg) | ||
### Import Golden(s) From CSV | ||
|
||
## Push Your Dataset Using DeepEval | ||
Alternatively, you can also choose to upload list of goldens from CSV files. Simply click on the _Upload Goldens_ button, and you'll have the opportunity to map CSV columns to golden fields when importing. | ||
|
||
Pushing an `EvaluationDataset` on Confident using `deepeval` is simply: | ||
The golden fields include: | ||
|
||
1. Create a dataset locally (same as how you would create a dataset as shown in the [datasets section](/docs/evaluation-datasets)). | ||
2. Populate it with `Golden`s. | ||
3. Push the new dataset to Confident AI. | ||
- `input`: a string representing the `input` to prompt your LLM application with during evaluation. | ||
- [Optional]`actual_output`: a string representing the generated `actual_output` of your LLM application for the corresponding `input`. | ||
- [Optional]`expected_output`: a string representing the ideal output based on for the corresponding `input`. | ||
- [Optional]`retrieval_context` a list of strings representing the retrieved text chunks of your LLM application for the corresponding `input`. This is only for RAG pipelines. | ||
- [Optional]`context`: a list of strings representing the ground truth as supporting context. | ||
- [Optional]`comments`: a string representing whatever comments your data annotators have for this particular golden (e.g. "Watch out for this expected output! It needs more work."). | ||
- [Optional]`additional_metadata`: a freeform JSON which you can use to include as any additional data which you can later make use of in code during evaluation time. | ||
|
||
:::warning | ||
Although you can also populate an `EvaluationDataset` with `LLMTestCase`s, we **HIGHLY** recommend that you do it with `Golden`s instead as it is more flexible to work with when dealing with datasets. | ||
::: | ||
### Upload Golden(s) Via DeepEval | ||
|
||
### Create A Dataset Locally | ||
Pushing an `EvaluationDataset` on Confident using `deepeval` is simply involves creating an `EvaluationDataset` with a list of `Golden`s, and pushing it to Confident AI by supplying the dataset alias. | ||
|
||
Here's a quick example of populating an `EvaluationDataset` with `Golden`s before pushing it to Confident AI: | ||
Here's the data structure of a `Golden`: | ||
|
||
```python | ||
from deepeval.dataset import EvaluationDataset, Golden | ||
from typing import Optional, List, Dict | ||
|
||
class Golden: | ||
input: str | ||
actual_output: Optional[str] | ||
expected_output: Optional[str] | ||
retrieval_context: Optional[List[str]] | ||
context: Optional[List[str]] | ||
comments: Optional[str] | ||
additional_metadata: Optional[Dict] | ||
``` | ||
|
||
original_dataset = [ | ||
{ | ||
"input": "What are your operating hours?", | ||
"actual_output": "...", | ||
"context": [ | ||
"Our company operates from 10 AM to 6 PM, Monday to Friday.", | ||
"We are closed on weekends and public holidays.", | ||
"Our customer service is available 24/7.", | ||
], | ||
}, | ||
<details><summary>Click to see fake data we'll be uploading</summary> | ||
<p> | ||
|
||
```python | ||
fake_data = [ | ||
{ | ||
"input": "Do you offer free shipping?", | ||
"actual_output": "...", | ||
"expected_output": "Yes, we offer free shipping on orders over $50.", | ||
"input": "I have a persistent cough and fever. Should I be worried?", | ||
"expected_output": ( | ||
"If your cough and fever persist or worsen, it could indicate a serious condition. " | ||
"Persistent fevers lasting more than three days or difficulty breathing should prompt immediate medical attention. " | ||
"Stay hydrated and consider over-the-counter fever reducers, but consult a healthcare provider for proper diagnosis." | ||
) | ||
}, | ||
{ | ||
"input": "What is your return policy?", | ||
"actual_output": "...", | ||
}, | ||
"input": "What should I do if I accidentally cut my finger deeply?", | ||
"expected_output": ( | ||
"Rinse the cut with soap and water, apply pressure to stop bleeding, and elevate the finger. " | ||
"Seek medical care if the cut is deep, bleeding persists, or your tetanus shot isn't up to date." | ||
), | ||
} | ||
] | ||
|
||
goldens = [] | ||
for datapoint in original_dataset: | ||
input = datapoint.get("input", None) | ||
actual_output = datapoint.get("actual_output", None) | ||
expected_output = datapoint.get("expected_output", None) | ||
context = datapoint.get("context", None) | ||
``` | ||
|
||
</p> | ||
</details> | ||
|
||
And here's a quick example of how to push `Golden`s within an `EvaluationDataset` to Confident AI: | ||
|
||
```python | ||
from deepeval.dataset import EvaluationDataset, Golden | ||
|
||
# See above for contents of fake_data | ||
fake_data = [...] | ||
|
||
goldens = [] | ||
for fake_datum in fake_data: | ||
golden = Golden( | ||
input=input, | ||
actual_output=actual_output, | ||
expected_output=expected_output, | ||
context=context | ||
input=fake_datum["input"], | ||
expected_output=fake_datum["expected_output"], | ||
) | ||
goldens.append(golden) | ||
|
||
dataset = EvaluationDataset(goldens=goldens) | ||
``` | ||
|
||
### Push Goldens to Confident AI | ||
|
||
After creating your `EvaluationDataset`, all you have to do is push it to Confident by providing an `alias` as an unique identifier. When you push an `EvaluationDataset`, the data is being uploaded as `Golden`s, **NOT** `LLMTestCase`s: | ||
After creating your `EvaluationDataset`, all you have to do is push it to Confident by providing an `alias` as an unique identifier. | ||
|
||
```python | ||
... | ||
|
||
# Provide an alias when pushing a dataset | ||
dataset.push(alias="My Confident Dataset") | ||
``` | ||
|
||
The `push()` method will upload all `Goldens` found in your dataset to Confident AI, ignoring any `LLMTestCase`s. If you wish to also include `LLMTestCase`s in the push, you can set the `auto_convert_test_cases_to_goldens` parameter to `True`: | ||
|
||
```python | ||
... | ||
|
||
dataset.push(alias="My Confident Dataset", auto_convert_test_cases_to_goldens=True) | ||
dataset.push(alias="QA Dataset") | ||
``` | ||
|
||
You can also choose to overwrite or append to an existing dataset if an existing dataset with the same alias already exist. | ||
|
||
```python | ||
... | ||
|
||
dataset.push(alias="My Confident Dataset", overwrite=False) | ||
# Overwrite existing datasets | ||
dataset.push(alias="QA Dataset", overwrite=True) | ||
``` | ||
|
||
:::note | ||
`deepeval` will prompt you in the terminal if no value for `overwrite` is provided. | ||
::: | ||
|
||
## What is a Golden? | ||
|
||
A "Golden" is what makes up an evaluation dataset and is very similar to a test case in `deepeval`, but they: | ||
|
||
- do not require an `actual_output`, so whilst test cases are always ready for evaluation, a golden isn't. | ||
- only exists within an `EvaluationDataset()`, while test cases can be defined anywhere. | ||
- contains an extra `additional_metadata` field, which is a dictionary you can define on Confident. Allows you to do some extra preprocessing on your dataset (eg., generating a custom LLM `actual_output` based on some variables in `additional_metadata`) before evaluation. | ||
- Does not require an `actual_output`, so whilst test cases are always ready for evaluation, a golden isn't. | ||
- Only exists within an `EvaluationDataset()`, while test cases can be defined anywhere. | ||
- Contains an extra `additional_metadata` field, which is a dictionary you can define on Confident. Allows you to do some extra preprocessing on your dataset (eg., generating a custom LLM `actual_output` based on some variables in `additional_metadata`) before evaluation. | ||
|
||
We introduced the concept of goldens because it allows you to create evaluation datasets on Confident without needing pre-computed `actual_output`s. This is especially helpful if you are looking to generate responses from your LLM application at evaluation time. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters