Skip to content

Commit

Permalink
update evals
Browse files Browse the repository at this point in the history
  • Loading branch information
souzatharsis committed Nov 25, 2024
1 parent 8660039 commit 22195fc
Show file tree
Hide file tree
Showing 42 changed files with 4,095 additions and 286 deletions.
6 changes: 5 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,8 @@ build:
poetry run jupyter-book build tamingllms/

clean:
poetry run jupyter-book clean tamingllms/
poetry run jupyter-book clean tamingllms/


convert:
poetry run jupyter nbconvert --to markdown $(file)
2 changes: 1 addition & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ sphinx-press-theme = "^0.9.1"
langchain-openai = "^0.2.9"
outlines = "^0.1.5"
transformers = "^4.46.3"
nbconvert = "^7.16.4"


[build-system]
Expand Down
Binary file modified tamingllms/_build/.doctrees/environment.pickle
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/markdown/intro.doctree
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/notebooks/evals.doctree
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree
Binary file not shown.
876 changes: 876 additions & 0 deletions tamingllms/_build/html/_images/conceptual-multi.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tamingllms/_build/html/_images/conceptual.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tamingllms/_build/html/_images/diagram1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tamingllms/_build/html/_images/emerging.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 12 additions & 1 deletion tamingllms/_build/html/_sources/markdown/intro.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
---
jupytext:
text_representation:
extension: .md
format_name: myst
kernelspec:
display_name: Python 3
language: python
name: python3
---

(intro)=
# Introduction

Expand Down Expand Up @@ -109,7 +120,7 @@ pip install -r requirements.txt
1. Create a `.env` file in the root directory of the project.
2. Add your API keys and other sensitive information to the `.env` file. For example:

```
```bash
OPENAI_API_KEY=your_openai_api_key_here
```

Expand Down
79 changes: 78 additions & 1 deletion tamingllms/_build/html/_sources/notebooks/evals.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,18 @@
"\n",
"Beyond their non-deterministic nature, LLMs present another fascinating challenge: emergent abilities that spontaneously arise as models scale up in size. These abilities - from basic question answering to complex reasoning - aren't explicitly programmed but rather emerge \"naturally\" as the models grow larger and are trained on more data. This makes evaluation fundamentally different from traditional software testing, where capabilities are explicitly coded and can be tested against clear specifications.\n",
"\n",
"The relationship between model scale and emergent abilities follows a fascinating non-linear pattern. Below certain size thresholds, specific abilities may be completely absent from the model - it simply cannot perform certain tasks, no matter how much you try to coax them out. However, once the model reaches critical points in its scaling journey, these abilities can suddenly manifest in what researchers call a phase transition - a dramatic shift from inability to capability. This unpredictable emergence of capabilities stands in stark contrast to traditional software development, where features are deliberately implemented and can be systematically tested.\n",
"```{figure} ../_static/evals/emerging.png\n",
"---\n",
"name: emerging-properties\n",
"alt: Emerging Properties\n",
"class: bg-primary mb-1\n",
"scale: 60%\n",
"align: center\n",
"---\n",
"Emergent abilities of large language models and the scale {cite}`wei2022emergentabilitieslargelanguage`.\n",
"```\n",
"\n",
" {numref}`emerging-properties` provides a list of emergent abilities of large language models and the scale. The relationship between model scale and emergent abilities follows a fascinating non-linear pattern. Below certain size thresholds, specific abilities may be completely absent from the model - it simply cannot perform certain tasks, no matter how much you try to coax them out. However, once the model reaches critical points in its scaling journey, these abilities can suddenly manifest in what researchers call a phase transition - a dramatic shift from inability to capability. This unpredictable emergence of capabilities stands in stark contrast to traditional software development, where features are deliberately implemented and can be systematically tested.\n",
"\n",
"The implications for evaluation are profound. While conventional software testing relies on stable test suites and well-defined acceptance criteria, LLM evaluation must contend with a constantly shifting landscape of capabilities. What worked to evaluate a 7B parameter model may be completely inadequate for a 70B parameter model that has developed new emergent abilities. This dynamic nature of LLM capabilities forces us to fundamentally rethink our approach to testing and evaluation.\n",
"\n",
Expand Down Expand Up @@ -189,6 +200,72 @@
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Solutions\n",
"\n",
"### Approaches\n",
"\n",
"```{figure} ../_static/evals/conceptual.png\n",
"---\n",
"name: conceptual\n",
"alt: Conceptual Overview\n",
"scale: 40%\n",
"align: center\n",
"---\n",
"Conceptual overview of LLM-based application evaluation.\n",
"```\n",
"\n",
"{numref}`conceptual`\n",
"\n",
"\n",
"```{figure} ../_static/evals/conceptual-multi.svg\n",
"---\n",
"name: conceptual-multi\n",
"alt: Conceptual Overview\n",
"scale: 40%\n",
"align: center\n",
"---\n",
"Conceptual overview of Multiple LLM-based applications evaluation.\n",
"```\n",
"\n",
"{numref}`conceptual-multi`\n",
"\n",
"### Evals Design\n",
"\n",
"### Human-Based Evaluation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Metrics-Based Evaluation\n",
"\n",
"A metrics-based approach enables automated benchmarking for evaluating LLM performance on specific tasks and capabilities. It provides a quantifiable and repeatable way to measure progress and identify areas for improvement. This is particularly useful for well-defined tasks, such as spam classification, data extraction or translation, where clear and objective evaluation criteria can be established. \n",
"\n",
"The core approach involves using pre-existing datasets (golden datasets) and establishing objective metrics to evaluate model performance. The process typically involves the following steps:\n",
"\n",
"1. **Selecting a relevant benchmark dataset:** The choice of dataset depends on the specific task or capability being evaluated. For example, the HumanEval dataset is used to evaluate code generation capabilities, while ChartQA focuses on chart understanding.\n",
"2. **Providing input to the LLM:** The LLM is given input from the selected dataset, prompting it to perform the specific task, such as answering questions, generating text, or translating languages.\n",
"3. **Comparing outputs to expected answers:** The LLM's outputs are compared to the expected or correct answers provided in the benchmark dataset.\n",
"4. **Quantifying the comparison using metrics:** The comparison is quantified using pre-defined metrics relevant to the task, providing a numerical score that reflects the LLM's performance. For instance, accuracy, precision, and recall are common metrics for classification tasks.\n",
"\n",
" The LLM is given input from the dataset, and its outputs are compared to expected or correct answers. The comparison is quantified using specific metrics relevant to the task. This approach enables efficient and automated evaluation, allowing for large-scale comparisons and tracking of progress over time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"```{bibliography}\n",
":filter: docname in docnames\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
14 changes: 14 additions & 0 deletions tamingllms/_build/html/_sources/notebooks/output_size_limit.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,20 @@
"- Goal: Generate a long-form report analyzing a company's financial statement.\n",
"- Input: A company's 10K SEC filing.\n",
"\n",
"```{figure} ../_static/structured_output/diagram1.png\n",
"---\n",
"name: content-chunking-with-contextual-linking\n",
"alt: Content Chunking with Contextual Linking\n",
"scale: 50%\n",
"align: center\n",
"---\n",
"Content Chunking with Contextual Linking Schematic Representation.\n",
"```\n",
"\n",
"The diagram in {numref}`content-chunking-with-contextual-linking` illustrates the process we will follow for handling long-form content generation with Large Language Models through \"Content Chunking with Contextual Linking.\" It shows how input content is first split into manageable chunks using a chunking function (e.g. `CharacterTextSplitter` with `tiktoken` tokenizer), then each chunk is processed sequentially while maintaining context from previous chunks. For each chunk, the system updates the context, generates a dynamic prompt with specific parameters, makes a call to the LLM chain, and stores the response. After all chunks are processed, the individual responses are combined with newlines to create the final report, effectively working around the token limit constraints of LLMs while maintaining coherence across the generated content.\n",
"\n",
"\n",
"\n",
"#### Step 1: Chunking the Content\n",
"\n",
"There are different methods for chunking, and each of them might be appropriate for different situations. However, we can broadly group chunking strategies in two types:\n",
Expand Down
143 changes: 143 additions & 0 deletions tamingllms/_build/html/_static/evals/conceptual-multi.d2
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Define the main container
container: {
label: Multiple LLM Applications Evaluation {
style.font-color: "#dd0000"
style.font-size: 20
}
examples: Examples {
shape: document
style.multiple: true
}

# Multiple Applications section
apps: {
app1: {
run: Run LLM_1 {
shape: sequence_diagram
style.font-color: "#dd0000"
}
}
app2: {
run: Run LLM_2 {
shape: sequence_diagram
style.font-color: "#dd0000"
}
}
dots: |md
...
|
appN: {
run: Run LLM_N {
shape: sequence_diagram
style.font-color: "#dd0000"
}
}
label: LLM Apps {
style.font-color: "#dd0000"
style.font-size: 20
}
}

# Evaluator section
evaluator: {
output: Output {
shape: rectangle
style.stroke: "#00aa00"
style.fill: "#e6ffe6"
style.font-size: 16
}

label: Evaluator {
style.font-color: "#00aa00"
style.font-size: 20
}
}

# Score outputs
scores: {
score1: Score 1 {
style.font-color: "#ff8800"
style.font-size: 18
}
score2: Score 2 {
style.font-color: "#ff8800"
style.font-size: 18
}
dots: |md
...
|
scoreN: Score N {
style.font-color: "#ff8800"
style.font-size: 18
}
}

# Ranking section
ranking: {
label: Leaderboard {
style.font-color: "#6600cc"
style.font-size: 20
}

board: {
shape: page
style.stroke: "#6600cc"
style.fill: "#f5f0ff"

label: "1. App 2 (0.95)\n2. App 1 (0.92)\n...\nN. App k (0.88)"
}
}

# Connections
examples -> evaluator.output: "(Optional)" {
style.stroke: "#0066cc"
style.stroke-dash: 5
}

# Input connections
examples -> apps.app1.run: Input {
style.stroke: "#dd0000"
}
examples -> apps.app2.run: Input {
style.stroke: "#dd0000"
}
examples -> apps.appN.run: Input {
style.stroke: "#dd0000"
}

# Output connections
apps.app1.run -> evaluator.output: Output {
style.stroke: "#dd0000"
}
apps.app2.run -> evaluator.output: Output {
style.stroke: "#dd0000"
}
apps.appN.run -> evaluator.output: Output {
style.stroke: "#dd0000"
}

# Score connections
evaluator.output -> scores.score1
evaluator.output -> scores.score2
evaluator.output -> scores.scoreN

# Ranking connections
scores.score1 -> ranking.board: {
style.stroke: "#6600cc"
}
scores.score2 -> ranking.board: {
style.stroke: "#6600cc"
}
scores.scoreN -> ranking.board: {
style.stroke: "#6600cc"
}
}

# Container styling
container.style: {
stroke-width: 2
border-radius: 10
}

# Global styling
direction: right
Loading

0 comments on commit 22195fc

Please sign in to comment.