Skip to content

Commit

Permalink
started evals
Browse files Browse the repository at this point in the history
  • Loading branch information
souzatharsis committed Nov 24, 2024
1 parent 7c5fd51 commit c5d5cff
Show file tree
Hide file tree
Showing 48 changed files with 432 additions and 3,218 deletions.
Binary file modified tamingllms/_build/.doctrees/environment.pickle
Binary file not shown.
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/markdown/intro.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/markdown/toc.doctree
Binary file not shown.
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/notebooks/nondeterminism.doctree
Binary file not shown.
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree
Binary file not shown.
31 changes: 31 additions & 0 deletions tamingllms/_build/html/_sources/markdown/evals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# The Challenge of Evaluating LLMs

## Introduction

Evaluating Large Language Models is a critical process for understanding their capabilities, limitations, and potential impact. As LLMs become increasingly integrated into various applications, it's essential to have robust evaluation methods to ensure their responsible and effective use.

LLM evaluation presents unique challenges compared to traditional software evaluation methods:
- **Focus on Capabilities, Not Just Functionality**: Traditional software evaluation verifies if the software performs its intended functions, while LLM evaluation assesses a broader range of capabilities, such as creative content generation and language translation, making it difficult to define success criteria.
- **Subjectivity and Difficulty in Measurement**: Traditional software success is often binary and easy to measure with metrics like speed and efficiency. In contrast, LLM evaluation involves subjective assessments of outputs like text quality and creativity, often requiring human judgment.
- **The Problem of Overfitting and Contamination**: Traditional software is less susceptible to overfitting, whereas LLMs risk contamination due to massive training datasets, potentially leading to inflated performance scores.
- **Evolving Benchmarks and Evaluation Methods**: Traditional software testing methodologies remain stable, but LLM evaluation methods and benchmarks are constantly evolving, complicating model comparisons over time.
- **Human Evaluation Plays a Crucial Role**: In traditional software, human involvement is limited, whereas LLM evaluation often relies on human judgment to assess complex, subjective qualities using methods like "Vibes-Checks" and "Systematic Annotations".



| Aspect | Traditional Software | LLMs |
|---------------------------------------------|---------------------------------------------------|------------------------------------------------------------------------------------------|
| **Capabilities vs. Functionality** | Focus on function verification. | Assess broader capabilities beyond basic functions. |
| **Measurement** | Binary success, easy metrics. | Subjective, often requires human judgment. |
| **Overfitting** | Less risk due to distinct controlled dataset. | High risk due to large datasets. |
| **Benchmarks** | Stable over time. | Constantly evolving, hard to standardize. |
| **Human Evaluation** | Limited role. | Crucial for subjective assessment. |

In conclusion, evaluating LLMs demands a different approach than traditional software due to the focus on capabilities, the subjective nature of output, the risk of contamination, and the evolving nature of benchmarks. Traditional software development focuses on clear-cut functionality and measurable metrics, while LLM evaluation requires a combination of automated, human-based, and model-based approaches to capture their full range of capabilities and limitations.




LLM evaluation encompasses various approaches to assess how well these models perform on different tasks and exhibit desired qualities. This involves measuring their performance on specific tasks, such as question answering or text summarisation, understanding their ability to perform more general tasks like reasoning or code generation, and analysing their potential for bias and susceptibility to adversarial attacks.

LLM evaluation serves several crucial purposes. Firstly, non-regression testing ensures that updates and modifications to LLMs don't negatively affect their performance or introduce new issues. Tracking evaluation scores helps developers maintain and improve model reliability. Secondly, evaluation results contribute to establishing benchmarks and ranking different LLMs based on their capabilities. These rankings inform users about the relative strengths and weaknesses of various models. Lastly, through evaluation, researchers can gain a deeper understanding of the specific abilities and limitations of LLMs. This helps identify areas for improvement and guide the development of new models with enhanced capabilities.
2 changes: 1 addition & 1 deletion tamingllms/_build/html/_sources/markdown/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ pip install -r requirements.txt
OPENAI_API_KEY=your_openai_api_key_here
```

```{warning}
```{note}
Never share your `.env` file or commit it to version control. It contains sensitive information that should be kept private.
```

Expand Down
53 changes: 0 additions & 53 deletions tamingllms/_build/html/_sources/markdown/markdown-notebooks.md

This file was deleted.

55 changes: 0 additions & 55 deletions tamingllms/_build/html/_sources/markdown/markdown.md

This file was deleted.

Loading

0 comments on commit c5d5cff

Please sign in to comment.