Skip to content

Commit

Permalink
updated docs
Browse files Browse the repository at this point in the history
  • Loading branch information
penguine-ip committed Jan 23, 2025
1 parent 86d1b11 commit 1c82f98
Show file tree
Hide file tree
Showing 4 changed files with 90 additions and 110 deletions.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
id: confident-ai-evaluation-dataset-evaluation
title: Evaluating Datasets
sidebar_label: Evaluating Datasets
title: Using Datasets For Evaluation
sidebar_label: Using Datasets For Evaluation
---

## Quick Summary
Expand Down
154 changes: 87 additions & 67 deletions docs/confident-ai/confident-ai-evaluation-dataset-management.mdx
Original file line number Diff line number Diff line change
@@ -1,128 +1,148 @@
---
id: confident-ai-evaluation-dataset-management
title: Managing Datasets
sidebar_label: Managing Datasets
title: Curating Datasets
sidebar_label: Curating Datasets
---

## Quick Summary
Confident AI provides your team a centralized place to **create, upload, and edit** evaluation datasets online. Think of it like a Google Sheets or Notion CSV editor, but the difference is that each row is already in the structure of an `LLMTestCase`, and most importantly you can use it directly in code when evaluating with `deepeval`.

Confident AI provides your team a centralized place to **create, generate, upload and edit** evaluation datasets online. You can manage evaluation datasets directly on Confident AI or using `deepeval`. To begin, create a fresh dataset on Confident AI on the "Datasets" page.

:::note
:::info
An evaluation dataset on Confident AI is a collection of goldens, which is extremely similar to a test case. You can learn more about goldens [here.](#what-is-a-golden)
:::

## Generate A Synthetic Dataset
## Create Your First Dataset

To create your first dataset, simply navigate to the **Datasets** page in your project space. There, you'll see a button that says _Create Dataset_, and you will be required to name your first dataset by providing it with an alias. This alias will be used to identify which will be used for evaluation later on.

You can upload documents in your knowledge base to generate synthetic goldens that can later be used as test cases for evaluation. Under the hood, Confident AI uses various document parsing algorithms to first extract contexts before generating goldens based on these contexts through numerous [data-evolution techniques.](https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms)
## Populate Your Dataset With Golden(s)

Simply click on the "Generate" button to upload a document of type `.pdf`, `.txt` or `.docx` to start generating.
Now that you've created a dataset, you can create a "golden" within your dataset that will later be converted to an `LLMTestCase` during evaluation time (we'll talk more about this later). There are a few ways you can populate your dataset with goldens, which includes:

![ok](https://confident-bucket.s3.amazonaws.com/generate-goldens.svg)
1. Creating a golden individually using Confident AI's goldens editor.
2. Importing a CSV of goldens to Confident AI.
3. Uploading a list of `Golden`s to Confident AI via `deepeval`.

:::note
If you're not sure what to include in your goldens, simply enter the inputs you're currently prompting your LLM application with when eyeballing outputs. You'll be able to automate this process by creating a list of goldens out of `input`s.
:::

A [golden is basically a test case](#what-is-a-golden) that isn't ready for evaluation yet. It holds additional information needed for a better dataset annotation experience, such as the ability to mark it as "ready" for evaluation, the ability to contain empty `actual_output`s that will later be populated at evaluation time, and the inclusion of additional columns and metadata that might be useful for you at evaluation time.

:::caution
You **MUST** set your OpenAI API Key for your project in order to generate synthetic goldens. You can set your OpenAI API Key in the "Project Details" page.
We highly recommend **AGAINST** running evaluations on pre-computed evaluation datasets since you'll want to test your LLM application based on the latest `actual_output`s that are generated as a consequence of your iteration, so if you find yourself filling in the `actual_output` field at any point in time, think again.
:::

## Upload A Dataset
### Create Individual Golden(s)

Alternatively, you can also choose to upload entire datasets from CSV files. Simply click on the "Upload Goldens" button to import goldens from CSV files.
You can create goldens manually by clicking on the _Create Golden_ button in the **Datasets** > **Dataset Editor** page, which will open an editor for you to fill in your golden information.

![ok](https://confident-bucket.s3.amazonaws.com/upload-goldens.svg)
### Import Golden(s) From CSV

## Push Your Dataset Using DeepEval
Alternatively, you can also choose to upload list of goldens from CSV files. Simply click on the _Upload Goldens_ button, and you'll have the opportunity to map CSV columns to golden fields when importing.

Pushing an `EvaluationDataset` on Confident using `deepeval` is simply:
The golden fields include:

1. Create a dataset locally (same as how you would create a dataset as shown in the [datasets section](/docs/evaluation-datasets)).
2. Populate it with `Golden`s.
3. Push the new dataset to Confident AI.
- `input`: a string representing the `input` to prompt your LLM application with during evaluation.
- [Optional]`actual_output`: a string representing the generated `actual_output` of your LLM application for the corresponding `input`.
- [Optional]`expected_output`: a string representing the ideal output based on for the corresponding `input`.
- [Optional]`retrieval_context` a list of strings representing the retrieved text chunks of your LLM application for the corresponding `input`. This is only for RAG pipelines.
- [Optional]`context`: a list of strings representing the ground truth as supporting context.
- [Optional]`comments`: a string representing whatever comments your data annotators have for this particular golden (e.g. "Watch out for this expected output! It needs more work.").
- [Optional]`additional_metadata`: a freeform JSON which you can use to include as any additional data which you can later make use of in code during evaluation time.

:::warning
Although you can also populate an `EvaluationDataset` with `LLMTestCase`s, we **HIGHLY** recommend that you do it with `Golden`s instead as it is more flexible to work with when dealing with datasets.
:::
### Upload Golden(s) Via DeepEval

### Create A Dataset Locally
Pushing an `EvaluationDataset` on Confident using `deepeval` is simply involves creating an `EvaluationDataset` with a list of `Golden`s, and pushing it to Confident AI by supplying the dataset alias.

Here's a quick example of populating an `EvaluationDataset` with `Golden`s before pushing it to Confident AI:
Here's the data structure of a `Golden`:

```python
from deepeval.dataset import EvaluationDataset, Golden
from typing import Optional, List, Dict

class Golden:
input: str
actual_output: Optional[str]
expected_output: Optional[str]
retrieval_context: Optional[List[str]]
context: Optional[List[str]]
comments: Optional[str]
additional_metadata: Optional[Dict]
```

original_dataset = [
{
"input": "What are your operating hours?",
"actual_output": "...",
"context": [
"Our company operates from 10 AM to 6 PM, Monday to Friday.",
"We are closed on weekends and public holidays.",
"Our customer service is available 24/7.",
],
},
<details><summary>Click to see fake data we'll be uploading</summary>
<p>

```python
fake_data = [
{
"input": "Do you offer free shipping?",
"actual_output": "...",
"expected_output": "Yes, we offer free shipping on orders over $50.",
"input": "I have a persistent cough and fever. Should I be worried?",
"expected_output": (
"If your cough and fever persist or worsen, it could indicate a serious condition. "
"Persistent fevers lasting more than three days or difficulty breathing should prompt immediate medical attention. "
"Stay hydrated and consider over-the-counter fever reducers, but consult a healthcare provider for proper diagnosis."
)
},
{
"input": "What is your return policy?",
"actual_output": "...",
},
"input": "What should I do if I accidentally cut my finger deeply?",
"expected_output": (
"Rinse the cut with soap and water, apply pressure to stop bleeding, and elevate the finger. "
"Seek medical care if the cut is deep, bleeding persists, or your tetanus shot isn't up to date."
),
}
]

goldens = []
for datapoint in original_dataset:
input = datapoint.get("input", None)
actual_output = datapoint.get("actual_output", None)
expected_output = datapoint.get("expected_output", None)
context = datapoint.get("context", None)
```

</p>
</details>

And here's a quick example of how to push `Golden`s within an `EvaluationDataset` to Confident AI:

```python
from deepeval.dataset import EvaluationDataset, Golden

# See above for contents of fake_data
fake_data = [...]

goldens = []
for fake_datum in fake_data:
golden = Golden(
input=input,
actual_output=actual_output,
expected_output=expected_output,
context=context
input=fake_datum["input"],
expected_output=fake_datum["expected_output"],
)
goldens.append(golden)

dataset = EvaluationDataset(goldens=goldens)
```

### Push Goldens to Confident AI

After creating your `EvaluationDataset`, all you have to do is push it to Confident by providing an `alias` as an unique identifier. When you push an `EvaluationDataset`, the data is being uploaded as `Golden`s, **NOT** `LLMTestCase`s:
After creating your `EvaluationDataset`, all you have to do is push it to Confident by providing an `alias` as an unique identifier.

```python
...

# Provide an alias when pushing a dataset
dataset.push(alias="My Confident Dataset")
```

The `push()` method will upload all `Goldens` found in your dataset to Confident AI, ignoring any `LLMTestCase`s. If you wish to also include `LLMTestCase`s in the push, you can set the `auto_convert_test_cases_to_goldens` parameter to `True`:

```python
...

dataset.push(alias="My Confident Dataset", auto_convert_test_cases_to_goldens=True)
dataset.push(alias="QA Dataset")
```

You can also choose to overwrite or append to an existing dataset if an existing dataset with the same alias already exist.

```python
...

dataset.push(alias="My Confident Dataset", overwrite=False)
# Overwrite existing datasets
dataset.push(alias="QA Dataset", overwrite=True)
```

:::note
`deepeval` will prompt you in the terminal if no value for `overwrite` is provided.
:::

## What is a Golden?

A "Golden" is what makes up an evaluation dataset and is very similar to a test case in `deepeval`, but they:

- do not require an `actual_output`, so whilst test cases are always ready for evaluation, a golden isn't.
- only exists within an `EvaluationDataset()`, while test cases can be defined anywhere.
- contains an extra `additional_metadata` field, which is a dictionary you can define on Confident. Allows you to do some extra preprocessing on your dataset (eg., generating a custom LLM `actual_output` based on some variables in `additional_metadata`) before evaluation.
- Does not require an `actual_output`, so whilst test cases are always ready for evaluation, a golden isn't.
- Only exists within an `EvaluationDataset()`, while test cases can be defined anywhere.
- Contains an extra `additional_metadata` field, which is a dictionary you can define on Confident. Allows you to do some extra preprocessing on your dataset (eg., generating a custom LLM `actual_output` based on some variables in `additional_metadata`) before evaluation.

We introduced the concept of goldens because it allows you to create evaluation datasets on Confident without needing pre-computed `actual_output`s. This is especially helpful if you are looking to generate responses from your LLM application at evaluation time.
40 changes: 0 additions & 40 deletions docs/confident-ai/confident-ai-introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -187,46 +187,6 @@ You can also run evaluations on Confident AI using our models, but that's a more

Now that you're logged in, create a python file, for example say `experiment_llm.py`. We're going to be evaluating a medical chatbot for this quickstart guide, but it can be any other LLM systems that you are building.

<details><summary>Click to see fake data we'll be using</summary>
<p>

```python
fake_data = [
{
"input": "I have a persistent cough and fever. Should I be worried?",
"actual_output": (
"Based on your symptoms, it could be a sign of a viral or bacterial infection. "
"However, if the fever persists for more than three days or you experience difficulty breathing, "
"please consult a doctor immediately."
),
"retrieval_context": [
"Coughing that lasts more than three weeks is typically classified as a chronic cough and could indicate conditions such as asthma, chronic bronchitis, or gastroesophageal reflux disease (GERD).",
"A fever is the body's natural response to infections, often caused by viruses or bacteria. Persistent fevers lasting more than three days should be evaluated by a healthcare professional as they may indicate conditions like pneumonia, tuberculosis, or sepsis.",
"Shortness of breath associated with fever and cough can be a sign of serious respiratory issues such as pneumonia, bronchitis, or COVID-19.",
"Self-care tips for mild symptoms include staying hydrated, taking over-the-counter fever reducers (e.g., acetaminophen or ibuprofen), and resting. Avoid suppressing a productive cough without consulting a healthcare provider."
]
},
{
"input": "What should I do if I accidentally cut my finger deeply?",
"actual_output": (
"If you cut your finger deeply, just rinse it with water and avoid applying any pressure. "
"Tetanus shots aren't necessary unless you see redness immediately."
),
"retrieval_context": [
"Deep cuts that are more than 0.25 inches deep or expose fat, muscle, or bone require immediate medical attention. Such wounds may need stitches to heal properly.",
"To minimize the risk of infection, wash the wound thoroughly with soap and water. Avoid using alcohol or hydrogen peroxide, as these can irritate the tissue and delay healing.",
"If the bleeding persists for more than 10 minutes or soaks through multiple layers of cloth or bandages, seek emergency care. Continuous bleeding might indicate damage to an artery or vein.",
"Watch for signs of infection, including redness, swelling, warmth, pain, or pus. Infections can develop even in small cuts if not properly cleaned or if the individual is at risk (e.g., diabetic or immunocompromised).",
"Tetanus, a bacterial infection caused by Clostridium tetani, can enter the body through open wounds. Ensure that your tetanus vaccination is up to date, especially if the wound was caused by a rusty or dirty object."
]
}
]

```

</p>
</details>

```python title="experiment_llm.py"
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
Expand Down
2 changes: 1 addition & 1 deletion docs/sidebarConfidentAI.js
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ module.exports = {
"confident-ai-evaluation-dataset-management",
"confident-ai-evaluation-dataset-evaluation",
],
collapsed: true,
collapsed: false,
},
{
type: "category",
Expand Down

0 comments on commit 1c82f98

Please sign in to comment.