-
Notifications
You must be signed in to change notification settings - Fork 22
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
souzatharsis
committed
Nov 24, 2024
1 parent
7c5fd51
commit c5d5cff
Showing
48 changed files
with
432 additions
and
3,218 deletions.
There are no files selected for viewing
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified
BIN
-1.12 KB
(94%)
tamingllms/_build/.doctrees/notebooks/nondeterminism.doctree
Binary file not shown.
Binary file not shown.
Binary file modified
BIN
-933 Bytes
(99%)
tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# The Challenge of Evaluating LLMs | ||
|
||
## Introduction | ||
|
||
Evaluating Large Language Models is a critical process for understanding their capabilities, limitations, and potential impact. As LLMs become increasingly integrated into various applications, it's essential to have robust evaluation methods to ensure their responsible and effective use. | ||
|
||
LLM evaluation presents unique challenges compared to traditional software evaluation methods: | ||
- **Focus on Capabilities, Not Just Functionality**: Traditional software evaluation verifies if the software performs its intended functions, while LLM evaluation assesses a broader range of capabilities, such as creative content generation and language translation, making it difficult to define success criteria. | ||
- **Subjectivity and Difficulty in Measurement**: Traditional software success is often binary and easy to measure with metrics like speed and efficiency. In contrast, LLM evaluation involves subjective assessments of outputs like text quality and creativity, often requiring human judgment. | ||
- **The Problem of Overfitting and Contamination**: Traditional software is less susceptible to overfitting, whereas LLMs risk contamination due to massive training datasets, potentially leading to inflated performance scores. | ||
- **Evolving Benchmarks and Evaluation Methods**: Traditional software testing methodologies remain stable, but LLM evaluation methods and benchmarks are constantly evolving, complicating model comparisons over time. | ||
- **Human Evaluation Plays a Crucial Role**: In traditional software, human involvement is limited, whereas LLM evaluation often relies on human judgment to assess complex, subjective qualities using methods like "Vibes-Checks" and "Systematic Annotations". | ||
|
||
|
||
|
||
| Aspect | Traditional Software | LLMs | | ||
|---------------------------------------------|---------------------------------------------------|------------------------------------------------------------------------------------------| | ||
| **Capabilities vs. Functionality** | Focus on function verification. | Assess broader capabilities beyond basic functions. | | ||
| **Measurement** | Binary success, easy metrics. | Subjective, often requires human judgment. | | ||
| **Overfitting** | Less risk due to distinct controlled dataset. | High risk due to large datasets. | | ||
| **Benchmarks** | Stable over time. | Constantly evolving, hard to standardize. | | ||
| **Human Evaluation** | Limited role. | Crucial for subjective assessment. | | ||
|
||
In conclusion, evaluating LLMs demands a different approach than traditional software due to the focus on capabilities, the subjective nature of output, the risk of contamination, and the evolving nature of benchmarks. Traditional software development focuses on clear-cut functionality and measurable metrics, while LLM evaluation requires a combination of automated, human-based, and model-based approaches to capture their full range of capabilities and limitations. | ||
|
||
|
||
|
||
|
||
LLM evaluation encompasses various approaches to assess how well these models perform on different tasks and exhibit desired qualities. This involves measuring their performance on specific tasks, such as question answering or text summarisation, understanding their ability to perform more general tasks like reasoning or code generation, and analysing their potential for bias and susceptibility to adversarial attacks. | ||
|
||
LLM evaluation serves several crucial purposes. Firstly, non-regression testing ensures that updates and modifications to LLMs don't negatively affect their performance or introduce new issues. Tracking evaluation scores helps developers maintain and improve model reliability. Secondly, evaluation results contribute to establishing benchmarks and ranking different LLMs based on their capabilities. These rankings inform users about the relative strengths and weaknesses of various models. Lastly, through evaluation, researchers can gain a deeper understanding of the specific abilities and limitations of LLMs. This helps identify areas for improvement and guide the development of new models with enhanced capabilities. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
53 changes: 0 additions & 53 deletions
53
tamingllms/_build/html/_sources/markdown/markdown-notebooks.md
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.